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Preface 

This module introduces the Connexions online textbook "Collaborative 
Statistics Using R" by Ananda Mahto based on "Collaborative Statistics" by 
Barbara Illowsky and Susan Dean. 


About "Collaborative Statistics" 


Collaborative Statistics was written by Barbara Illowsky and Susan Dean, 
faculty members at De Anza College in Cupertino, California. 


The original preface to the book as written by professors Illowsky and 
Dean, now follows: 


This book is intended for introductory statistics courses being taken by 
students at two— and four—year colleges who are majoring in fields other 
than math or engineering. Intermediate algebra is the only prerequisite. The 
book focuses on applications of statistical knowledge rather than the theory 
behind it. The text is named Collaborative Statistics because students learn 
best by doing. In fact, they learn best by working in small groups. The old 
saying “two heads are better than one” truly applies here. 


Our emphasis in this text is on four main concepts: 


e thinking statistically 

e incorporating technology 
e working collaboratively 
¢ writing thoughtfully 


These concepts are integral to our course. Students learn the best by 
actively participating, not by just watching and listening. Teaching should 
be highly interactive. Students need to be thoroughly engaged in the 
learning process in order to make sense of statistical concepts. 
Collaborative Statistics provides techniques for students to write across the 
curriculum, to collaborate with their peers, to think statistically, and to 
incorporate technology. 


This book takes students step by step. The text is interactive. Therefore, 
students can immediately apply what they read. Once students have 


completed the process of problem solving, they can tackle interesting and 
challenging problems relevant to today’s world. The problems require the 
students to apply their newly found skills. The book also contains labs that 
use real data and practices that lead students step by step through the 
problem solving process. 


About this custom edition of Collaborative Statistics 


This custom edition of Collaborative Statistics is designed for use in a short 
course in introductory statistics. Additionally, the text includes examples of 
how to use the R-project open-source statistical package for the 
calculations. 


R software was chosen for several reasons. First, it is free. Second, it is 
relatively easy to learn once you actually start using it. Third, the software 
is stable and quite advanced; there are many features implemented in R that 
are not found in commercial software packages. Fourth, R has great 
community support; if there are any questions you might have, there are 
numerous user-groups which can help you solve your problems. 


We hope that you enjoy the process of learning about statistics and 
simultaneously learning how to use R. 


R code listings 


Code listings in this text have been formatted in a way to make it easy to 
copy and paste from this document into an R script or an R session. All 
comments are preceded by a single hash symbol (#), and all sample output 
is preceded by two hash symbols (##). These are seen as comments by the 
R software, so will not be processed when you run your code. 


Introduction 
This module provides a brief introduction to the field of statistics, including 
examples of how these topics shows up in a variety of real-life examples. 


Student Learning Objectives 
By the end of this chapter, the student should be able to: 


¢ Recognize and differentiate between key terms. 

e Apply various types of sampling methods to data collection. 

¢ Create and interpret frequency tables. 

¢ Be able to apply some basic functions in R to generate samples and to 
summarize data. 


The R functions you will be using in this chapter are Cumsum(), cut(), 
length(), sample(), set.seed(), and table(). 


Introduction 


You are probably asking yourself the question, "When and where will I use 
Statistics?". If you read any newspaper or watch television, or use the 
Internet, you will see statistical information. There are statistics about 
crime, sports, education, politics, and real estate. Typically, when you read a 
newspaper article or watch a news program on television, you are given 
sample information. With this information, you may make a decision about 
the correctness of a statement, claim, or "fact." Statistical methods can help 
you make the "best educated guess." 


Since you will undoubtedly be given statistical information at some point in 
your life, you need to know some techniques to analyze the information 
thoughtfully. Think about buying a house or managing a budget. Think 
about your chosen profession. The fields of economics, business, 
psychology, education, biology, law, computer science, police science, and 
early childhood development require at least a basic understanding of 
Statistics. 


Included in this chapter are the basic ideas of statistics. You will also learn 
how data are gathered and what "good" data are. Additionally, you will be 
introduced to some very basic functions in R to help you work more 
efficiently. 


Statistics 

This module introduces the concept of statistics, specifically the ability to 
use Statistics to describe data (descriptive statistics) as well as draw 
conclusions (inferential statistics). An optional classroom exercise is 
included. 


The science of statistics deals with the collection, analysis, interpretation, 
and presentation of data. We see and use data in our everyday lives. 


Optional Collaborative Classroom Exercise 


In your classroom, try this exercise. Have class members write down the 
average time (in hours, to the nearest half-hour) they sleep per night. Your 
instructor will record the data. Then create a simple graph (called a dot 
plot) of the data. A dot plot consists of a number line and dots (or points) 
positioned above the number line. For example, consider the following 
data: 


Dy 004-0; 0,0, 0.04 020,0.0,0.0,- 75.75 8,0, 9 


The dot plot for this data would be as follows: 
Frequency of Average Time (in Hours) Spent Sleeping per Night 
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Does your dot plot look the same as or different from the example? Why? If 
you did the same example in an English class with the same number of 
students, do you think the results would be the same? Why or why not? 


Where do your data appear to cluster? How could you interpret the 
clustering? 


The questions above ask you to analyze and interpret your data. With this 
example, you have begun your study of statistics. 


In this course, you will learn how to organize and summarize data. 
Organizing and summarizing data is called descriptive statistics. Two ways 
to summarize data are by graphing and by numbers (for example, finding an 
average). After you have studied probability and probability distributions, 
you will use formal methods for drawing conclusions from "good" data. 
The formal methods are called inferential statistics. Statistical inference 
uses probability to determine how confident we can be that the conclusions 
are correct. 


Effective interpretation of data (inference) is based on good procedures for 
producing data and thoughtful examination of the data. You will encounter 
what will seem to be too many mathematical formulas for interpreting data. 
The goal of statistics is not to perform numerous calculations using the 
formulas, but to gain an understanding of your data. The calculations can be 
done using a calculator or a computer. The understanding must come from 
you. If you can thoroughly grasp the basics of statistics, you can be more 
confident in the decisions you make in life. 


Creating dot plots inR 


A dot plot is a very basic graph that can quickly show you how your data 
are distributed, and are useful with small datasets. In R, a dot plot is 
referred to as a strip chart, and is plotted using the stripchart() 
function. 


The stripchart() function over-plots points on top of each other by 
default. To override this behavior, optional arguments such as method and 
offset are added to the R command. 


hours.sleep = c(5, 5.5, 6, 6, 6, 6.5, 6.5, 
6:55;: 615): Te Te B78; 9) 


stripchart(hours.sleep, method = "stack", 
offset = 1, frame.plot = FALSE, 
at = .25) 
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Glossary 


Data 
A set of observations (a set of possible outcomes). Most data can be 
put into two groups: qualitative (hair color, ethnic groups and other 
attributes of the population) and quantitative (distance traveled to 
college, number of children in a family, etc.). Quantitative data can be 
separated into two subgroups: discrete and continuous. Data is 
discrete if it is the result of counting (the number of students of a given 
ethnic group in a class, the number of books on a shelf, etc.). Data is 
continuous if it is the result of measuring (distance traveled, weight of 
luggage, etc.) 


Statistic 
A numerical characteristic of the sample. A statistic estimates the 
corresponding population parameter. For example, the average number 
of full-time students in a 7:30 a.m. class for this term (statistic) is an 


estimate for the average number of full-time students in any class this 
term (parameter). 


Probability 
This module introduces the concept of probability as a mathematical 
measure of randomness, including a number of real-world applications. 


Probability is a mathematical tool used to study randomness. It deals with 
the chance (the likelihood) of an event occurring. For example, if you toss a 
fair coin 4 times, the outcomes may not be 2 heads and 2 tails. However, if 
you toss the same coin 4,000 times, the outcomes will be close to half heads 
and half tails. The expected theoretical probability of heads in any one toss 
is + or 0.5. Even though the outcomes of a few repetitions are uncertain, 
there is a regular pattern of outcomes when there are many repetitions. 
After reading about the English statistician Karl Pearson who tossed a coin 
24,000 times with a result of 12,012 heads, one of the authors tossed a coin 
996 


2,000 times. The results were 996 heads. The fraction >555 is equal to 0.498 


which is very close to 0.5, the expected probability. 


The theory of probability began with the study of games of chance such as 
poker. Predictions take the form of probabilities. To predict the likelihood 
of an earthquake, of rain, or whether you will get an A in this course, we 
use probabilities. Doctors use probability to determine the chance of a 
vaccination causing the disease the vaccination is supposed to prevent. A 
stockbroker uses probability to determine the rate of return on a client's 
investments. You might use probability to decide to buy a lottery ticket or 
not. In your study of statistics, you will use the power of mathematics 
through probability calculations to analyze and interpret your data. 


Glossary 


Probability 
A number between 0 and 1, inclusive, that gives the likelihood that a 
specific event will occur. The foundation of statistics is given by the 
following 3 axioms (by A. N. Kolmogorov, 1930’s): Let S denote the 
sample space and A and B are two events in S . Then: 


© 0<P(A)<1. 
e If A and B are any two mutually exclusive events, then 
P(A or B) = P(A) + P(B). 


Key Terms 
This module introduces a number of key terms related to statistical 
sampling and data. 


In statistics, we generally want to study a population. You can think of a 
population as an entire collection of persons, things, or objects under study. 
To study the larger population, we select a sample. The idea of sampling is 
to select a portion (or subset) of the larger population and study that portion 
(the sample) to gain information about the population. Data are the result of 
sampling from a population. 


Because it takes a lot of time and money to examine an entire population, 
sampling is a very practical technique. If you wished to compute the overall 
grade point average at your school, it would make sense to select a sample 
of students who attend the school. The data collected from the sample 
would be the students' grade point averages. In presidential elections, 
opinion poll samples of 1,000 to 2,000 people are taken. The opinion poll is 
supposed to represent the views of the people in the entire country. 
Manufacturers of canned carbonated drinks take samples to determine if a 
16 ounce can contains 16 ounces of carbonated drink. 


From the sample data, we can calculate a statistic. A statistic is a number 
that is a property of the sample. For example, if we consider one math class 
to be a sample of the population of all math classes, then the average 
number of points earned by students in that one math class at the end of the 
term is an example of a statistic. The statistic is an estimate of a population 
parameter. A parameter is a number that is a property of the population. 
Since we considered all math classes to be the population, then the average 
number of points earned per student over all the math classes is an example 
of a parameter. 


One of the main concerns in the field of statistics is how accurately a 
Statistic estimates a parameter. The accuracy really depends on how well the 
sample represents the population. The sample must contain the 
characteristics of the population in order to be a representative sample. We 
are interested in both the sample statistic and the population parameter in 
inferential statistics. In a later chapter, we will use the sample statistic to 
test the validity of the established population parameter. 


A variable, notated by capital letters like X and Y, is a characteristic of 
interest for each person or thing in a population. Variables may be 
numerical or categorical. Numerical variables take on values with equal 
units such as weight in pounds and time in hours. Categorical variables 
place the person or thing into a category. If we let X equal the number of 
points earned by one math student at the end of a term, then X isa 
numerical variable. If we let Y be a person's party affiliation, then examples 
of Y include Republican, Democrat, and Independent. Y is a categorical 
variable. We could do some math with values of X (calculate the average 
number of points earned, for example), but it makes no sense to do math 
with values of Y (calculating an average party affiliation makes no sense). 


Data are the actual values of the variable. They may be numbers or they 
may be words. Datum is a single value. 


Two words that come up often in statistics are average and proportion. If 
you were to take three exams in your math classes and obtained scores of 
86, 75, and 92, you calculate your average score by adding the three exam 
scores and dividing by three (your average score would be 84.3 to one 
decimal place). If, in your math class, there are 40 students and 22 are men 


and 18 are women, then the proportion of men students is ee and the 


proportion of women students is a. Average and proportion are discussed 


in more detail in later chapters. 


Example: 
Exercise: 


Problem: 


Define the key terms from the following study: We want to know the 
average amount of money first year college students spend at ABC 
College on school supplies that do not include books. We randomly 
survey 100 first year students at the college. Three of those students 
spent $150, $200, and $225, respectively. 


Solution: 


The population is all first year students attending ABC College this 
term. 


The sample could be all students enrolled in one section of a 
beginning statistics course at ABC College (although this sample may 
not represent the entire population). 


The parameter is the average amount of money spent (excluding 
books) by first year college students at ABC College this term. 


The statistic is the average amount of money spent (excluding books) 
by first year college students in the sample. 


The variable could be the amount of money spent (excluding books) 
by one first year student. Let X = the amount of money spent 
(excluding books) by one first year student attending ABC College. 


The data are the dollar amounts spent by the first year students. 
Examples of the data are $150, $200, and $225. 


Glossary 


Average 


Data 


A number that describes the central tendency of the data. There are a 
number of specialized averages, including the arithmetic mean, 
weighted mean, median, mode, and geometric mean. 


A set of observations (a set of possible outcomes). Most data can be 
put into two groups: qualitative (hair color, ethnic groups and other 
attributes of the population) and quantitative (distance traveled to 
college, number of children in a family, etc.). Quantitative data can be 
separated into two subgroups: discrete and continuous. Data is 
discrete if it is the result of counting (the number of students of a given 
ethnic group in a class, the number of books on a shelf, etc.). Data is 


continuous if it is the result of measuring (distance traveled, weight of 
luggage, etc.) 


Proportion 


e Asa number: A proportion is the number of successes divided by 
the total number in the sample. 

e Asa probability distribution: Given a binomial random variable 
(RV), X ~B(n, p), consider the ratio of the number X of 
successes in n Bernouli trials to the number n of trials. P/= a. 
This new RV is called a proportion, and if the number of trials, n, 
is large enough, P?~N p, =? . 


Data 

This module introduces the concepts of qualitative data, quantitative 
continuous data, and quantitative discrete data as used in statistics. Sample 
problems are included. 


Data may come from a population or from a sample. Small letters like x or 
y generally are used to represent data values. Most data can be put into the 
following categories: 


e¢ Qualitative 
¢ Quantitative 


Qualitative data are the result of categorizing or describing attributes of a 
population. Hair color, blood type, ethnic group, the car a person drives, and 
the street a person lives on are examples of qualitative data. Qualitative data 
are generally described by words or letters. For instance, hair color might 
be black, dark brown, light brown, blonde, gray, or red. Blood type might 
be AB+, O-, or B+. Qualitative data are not as widely used as quantitative 
data because many numerical techniques do not apply to the qualitative 
data. For example, it does not make sense to find an average hair color or 
blood type. 


Quantitative data are always numbers and are usually the data of choice 
because there are many methods available for analyzing the data. 
Quantitative data are the result of counting or measuring attributes of a 
population. Amount of money, pulse rate, weight, number of people living 
in your town, and the number of students who take statistics are examples 
of quantitative data. Quantitative data may be either discrete or 
continuous. 


All data that are the result of counting are called quantitative discrete 
data. These data take on only certain numerical values. If you count the 
number of phone calls you receive for each day of the week, you might get 
0, 1, 2, 3, etc. 


All data that are the result of measuring are quantitative continuous data 
assuming that we can measure accurately. Measuring angles in radians 


might result in the numbers 2, 4 ,2 ,7, miu , etc. If you and your friends 
8 6? 32 4 y Mi 


carry backpacks with books in them to school, the numbers of books in the 


backpacks are discrete data and the weights of the backpacks are continuous 
data. 


Example: 

Data Sample of Quantitative Discrete Data 

The data are the number of books students carry in their backpacks. You 
sample five students. Two students carry 3 books, one student carries 4 
books, one student carries 2 books, and one student carries 1 book. The 
numbers of books (3, 4, 2, and 1) are the quantitative discrete data. 


Example: 

Data Sample of Quantitative Continuous Data 

The data are the weights of the backpacks with the books in it. You sample 
the same five students. The weights (in pounds) of their backpacks are 6.2, 
7, 6.8, 9.1, 4.3. Notice that backpacks carrying three books can have 
different weights. Weights are quantitative continuous data because 
weights are measured. 


Example: 

Data Sample of Qualitative Data 

The data are the colors of backpacks. Again, you sample the same five 
students. One student has a red backpack, two students have black 
backpacks, one student has a green backpack, and one student has a gray 
backpack. The colors red, black, black, green, and gray are qualitative data. 


Note: You may collect data as numbers and report it categorically. For 
example, the quiz scores for each student are recorded throughout the term. 


At the end of the term, the quiz scores are reported as A, B, C, D, or F. 


Glossary 


Continuous Random Variable 
A random variable (RV) whose outcomes are measured. 


Example: 
The height of trees in the forest is a continuous RV. 


Data 
A set of observations (a set of possible outcomes). Most data can be 
put into two groups: qualitative (hair color, ethnic groups and many 
other attributes of population) and quantitative (distance traveled to 
college, number of children in a family, etc.). In its turn quantitative 
data can be separated into two subgroups: discrete and continuous. 
Roughly speaking, data is discrete if it is result of counting (a number 
of student of the given ethnic group in a class, a number of books on a 
shelf, etc.), and data is continuous if it is result of measuring (distance 
traveled, weight of luggage, etc.) 


Discrete Random Variable 
A random variable (RV) whose outcomes are counted. 


Qualitative Data 
See Data. 


Quantitative Data 
See Data. 


Sampling 

This module introduces the concept of statistical sampling. Students are 
taught the difference between a simple random sample, stratified sample, 
cluster sample, systematic sample, and convenience sample. Example 
problems are provided, including an optional classroom activity. 


Gathering information about an entire population often costs too much or is 
virtually impossible. Instead, we use a sample of the population. A sample 
should have the same characteristics as the population it is 
representing. Most statisticians use various methods of random sampling 
in an attempt to achieve this goal. This section will describe a few of the 
most common methods. 


There are several different methods of random sampling. In each form of 
random sampling, each member of a population initially has an equal 
chance of being selected for the sample. Each method has pros and cons. 
The easiest method to describe is called a simple random sample. Two 
simple random samples contain members equally representative of the 
entire population. In other words, each sample of the same size has an equal 
chance of being selected. For example, suppose Lisa wants to form a four- 
person study group (herself and three other people) from her pre-calculus 
class, which has 33 members including Lisa. To choose a simple random 
sample of size 3 from the other members of her class, Lisa could put all the 
other 32 names in a hat, shake the hat, close her eyes, and pick out 3 names. 
A more technological way is for Lisa to first list the names of her 
classmates together with a two-digit number as shown below. 


ID Name ID Name ID Name 
01 Anselmo 12 Larry 23 Rowell 


02 Bayani 13 Lizzy 24 Salangsang 


ID Name ID Name ID Name 


03 Cheng 14 Macierz 25 Slade 
04 Cuarismo 15 Motogawa 26 Stracher 
05 Cuningham 16 Okimoto 27 Tallai 
06 Fontecha 17 Patel 28 Tran 

07 Hong 18 Price 29 Wai 

08 Hoobler 19 Quizon 30 Wood 
09 Jiao 20 Reyes 31 Yogi 

10 Khan 21 Roquero 32 Zoe 

11 King Z2 Roth 


Class Roster 


Lisa can either use a table of random numbers (found in many statistics 
books as well as mathematical handbooks) or a calculator or computer to 
generate random numbers. For this example, suppose Lisa chooses to 
generate random numbers by using R. She enters the statement 
sample(32, 3), where 32 is the number of students in the class 
(excluding herself), and 3 is the number of samples she wants. If she wants 
her random sample to be replicable, she needs to set a seed for her sample 
by using the set .Seed( ) as demonstrated in the second example. When 
you try this exercise, you should get different results if you are not using a 
seed value or if you are using a different seed value from the one in the 
example code. 


Note:Use 


set.seed() 


whenever you want to be able to reproduce your results. You can, for 
instance, set the seed at the date that you are first running your experiment. 
For example, if your first experiment was being done on August 1 2010, 
you might write 20100801 for your seed. Every time you re-run your 
experiment use the same date from your original experiment as your seed, 
and your output will be the same. 


# A random sample of 3 from 32 

sample(32, 3) 

## [1] 7 10 32 

# Using set.seed() to get a reproducible sample 
# The seed can be any number you want 
set.seed(123) 

sample(32, 3) 

## [1] 10 25 13 


Using this information, Lisa will select the students with the ID numbers 
generated by R. 


Sometimes, it is difficult or impossible to obtain a simple random sample 
because populations are too large. Then we choose other forms of sampling 
methods that involve a chance process for getting the sample. Other well- 
known random sampling methods are the stratified sample, the cluster 
sample, and the systematic sample. 


To choose a stratified sample, divide the population into groups called 
strata and then take a sample from each stratum. For example, you could 
stratify (group) your college population by department and then choose a 
simple random sample from each stratum (each department) to get a 
stratified random sample. To choose a simple random sample from each 
department, number each member of the first department, number each 
member of the second department and do the same for the remaining 


departments. Then use simple random sampling to choose numbers from 
the first department and do the same for each of the remaining departments. 
Those numbers picked from the first department, picked from the second 
department and so on represent the members who make up the stratified 
sample. 


To choose a cluster sample, divide the population into strata and then 
randomly select some of the strata. All the members from these strata are in 
the cluster sample. For example, if you randomly sample four departments 
from your stratified college population, the four departments make up the 
cluster sample. You could do this by numbering the different departments 
and then choose four different numbers using simple random sampling. All 
members of the four departments with those numbers are the cluster 
sample. 


To choose a systematic sample, randomly select a starting point and take 
every nth piece of data from a listing of the population. For example, 
suppose you have to do a phone survey. Your phone book contains 20,000 
residence listings. You must choose 400 names for the sample. Number the 
population 1 - 20,000 and then use a simple random sample to pick a 
number that represents the first name of the sample. Then choose every 
50th name thereafter until you have a total of 400 names (you might have to 
go back to the of your phone list). Systematic sampling is frequently chosen 
because it is a simple method. 


A type of sampling that is nonrandom is convenience sampling. 
Convenience sampling involves using results that are readily available. For 
example, a computer software store conducts a marketing study by 
interviewing potential customers who happen to be in the store browsing 
through the available software. The results of convenience sampling may be 
very good in some cases and highly biased (favors certain outcomes) in 
others. 


Sampling data should be done very carefully. Collecting data carelessly can 
have devastating results. Surveys mailed to households and then returned 
may be very biased (for example, they may favor a certain group). It is 
better for the person conducting the survey to select the sample 
respondents. 


In reality, simple random sampling should be done with replacement That 
is, once a member is picked that member goes back into the population and 
thus may be chosen more than once. This is true random sampling. 
However for practical reasons, in most populations, simple random 
sampling is done without replacement. That is, a member of the 
population may be chosen only once. Most samples are taken from large 
populations and the sample tends to be small in comparison to the 
population. Since this is the case, sampling without replacement is 
approximately the same as sampling with replacement because the chance 
of picking the same sample more than once using with replacement is very 
low. 


For example, in a college population of 10,000 people, suppose you want to 
pick a sample of 1000 for a survey. For any particular sample of 1000, if 
you are sampling with replacement, 


e the chance of picking the first person is 1000 out of 10,000 (0.1000); 

e the chance of picking a different second person for this sample is 999 
out of 10,000 (0.0999); 

e the chance of picking the same person again is 1 out of 10,000 (very 
low). 


If you are sampling without replacement, 


e the chance of picking the first person for any particular sample is 1000 
out of 10,000 (0.1000); 

e the chance of picking a different second person is 999 out of 9,999 
(0.0999); 

e you do not replace the first person before picking the next person. 


Compare the fractions 999/10,000 and 999/9,999. For accuracy, carry the 
decimal answers to 4 place decimals. To 4 decimal places, these numbers 
are equivalent (0.0999). 


Sampling without replacement instead of sampling with replacement only 
becomes a mathematics issue when the population is small which is not that 
common. For example, if the population is 25 people, the sample is 10 and 
you are sampling with replacement for any particular sample, 


e the chance of picking the first person is 10 out of 25 and a different 
second person is 9 out of 25 (you replace the first person). 


If you sample without replacement, 


e the chance of picking the first person is 10 out of 25 and then the 
second person (which is different) is 9 out of 24 (you do not replace 
the first person). 


Compare the fractions 9/25 and 9/24. To 4 decimal places, 9/25 = 0.3600 
and 9/24 = 0.3750. To 4 decimal places, these numbers are not equivalent. 


You can also use R to sample with replacement by adding replace = 
TRUE to the sample( ) function. Imagine, for instance, that you want to 
replicate flipping a coin 20 times. Since there is only one heads and one 
tails in our population, we use replacement to get our sample. In the 
example below, we are again using the set . Seed() function so that you 
can confirm that you are getting the same results. 


# Simulating coin flipping 

coin.flips = c("H", "T") 

set.seed(123) 

sample(coin.flips, 30, replace = TRUE) 

HH [1] mut et mut a oT my a Tr ay mut 
os 

HH [12] mut ae ie mut rT mut my mut ry ey" 
ie 

HH [23] my wy" ry" mye i my! mye mut 


When you analyze data, it is important to be aware of sampling errors and 
nonsampling errors. The actual process of sampling causes sampling errors. 
For example, the sample may not be large enough or representative of the 
population. Factors not related to the sampling process cause nonsampling 
errors. A defective counting device can cause a nonsampling error. 


If we were to examine two samples representing the same population, they 
would, more than likely, not be the same. Just as there is variation in data, 
there is variation in samples. As you become accustomed to sampling, the 
variability will seem natural. 


Optional Collaborative Classroom Exercise 


Exercise: 


Problem: 


As a class, determine whether or not the following samples are 
representative. If they are not, discuss the reasons. 


1. To find the average GPA of all students in a university, use all 
honor students at the university as the sample. 

2. To find out the most popular cereal among young people under 
the age of 10, stand outside a large supermarket for three hours 
and speak to every 20th child under age 10 who enters the 
supermarket. 

3. To find the average annual income of all adults in the United 
States, sample U.S. congressmen. Create a cluster sample by 
considering each state as a stratum (group). By using simple 
random sampling, select states to be part of the cluster. Then 
survey every U.S. congressman in the cluster. 

4. To determine the proportion of people taking public transportation 
to work, survey 20 people in New York City. Conduct the survey 
by sitting in Central Park on a bench and interviewing every 
person who sits next to you. 

5. To determine the average cost of a two day stay in a hospital in 
Massachusetts, survey 100 hospitals across the state using simple 
random sampling. 


Variation and Critical Evaluation 

This module discusses statistical variability within data and samples. 
Students will be given the opportunity to see this variability in action 
through participation in an optional classroom exercise. This module also 
has a section that discusses Critical Evaluation. 


Variation in Data 


Variation is present in any set of data. For example, 16-ounce cans of 
beverage may contain more or less than 16 ounces of liquid. In one study, 
eight 16 ounce cans were measured and produced the following amount (in 
ounces) of beverage: 


15.6, 10.1, 45.2, 14.6, 15,0,,15.9, 16.0,.15.5 


Measurements of the amount of beverage in a 16-ounce can may vary 
because different people make the measurements or because the exact 
amount, 16 ounces of liquid, was not put into the cans. Manufacturers 
regularly run tests to determine if the amount of beverage in a 16-ounce can 
falls within the desired range. 


Be aware that as you take data, your data may vary somewhat from the data 
someone else is taking for the same purpose. This is completely natural. 
However, if two or more of you are taking the same data and get very 
different results, it is time for you and the others to reevaluate your data- 
taking methods and your accuracy. 


Variation in Samples 


It was mentioned previously that two or more samples from the same 
population and having the same characteristics as the population may be 
different from each other. Suppose Doreen and Jung both decide to study 
the average amount of time students sleep each night and use all students at 
their college as the population. Doreen uses systematic sampling and Jung 
uses cluster sampling. Doreen's sample will be different from Jung's sample 
even though both samples have the characteristics of the population. Even if 


Doreen and Jung used the same sampling method, in all likelihood their 
samples would be different. Neither would be wrong, however. 


Think about what contributes to making Doreen's and Jung's samples 
different. 


If Doreen and Jung took larger samples (i.e. the number of data values is 
increased), their sample results (the average amount of time a student 
sleeps) would be closer to the actual population average. But still, their 
samples would be, in all likelihood, different from each other. This 
variability in samples cannot be stressed enough. 


Size of a Sample 


The size of a sample (often called the number of observations) is important. 
The examples you have seen in this book so far have been small. Samples 
of only a few hundred observations, or even smaller, are sufficient for many 
purposes. In polling, samples that are from 1200 to 1500 observations are 
considered large enough and good enough if the survey is random and is 
well done. You will learn why when you study confidence intervals. 


Optional Collaborative Classroom Exercise 


Exercise: 


Problem: 


Divide into groups of two, three, or four. Your instructor will give each 
group one 6-sided die. Try this experiment twice. Roll one fair die (6- 
sided) 20 times. Record the number of ones, twos, threes, fours, fives, 
and sixes you get below ("frequency" is the number of times a 
particular face of the die occurs): 


Face on Die Frequency 


i) 


rs) 
6 


First Experiment (20 rolls) 


Face on Die Frequency 


i. 


rs) 
6 


Second Experiment (20 rolls) 


Did the two experiments have the same results? Probably not. If you 
did the experiment a third time, do you expect the results to be 
identical to the first or second experiment? (Answer yes or no.) Why 
or why not? 


Which experiment had the correct results? They both did. The job of 
the statistician is to see through the variability and draw appropriate 
conclusions. 


Running this experiment in R 


While it is much more interesting to experiment using a real die, we can 
also simulate it in R. To do this, we first create an object in R with the 
numbers 1 through 6 (representing the six faces on a die). After that, we can 
use the Sample( ) function to take as many samples as we require, adding, 
of course, replace = TRUE to use sampling with replacement. 


# Six faces on a die 

die.face = c(1:6) 

set.seed(123) 

aa = sample(die.face, 20, replace = TRUE) 

# Create a frequency table to analyze the 

# results. The first row shows the number on the 
# die face. The second row shows the frequency 
# at which that number was rolled. 

table(aa) 

HH aa 

## 1234056 

##333326 

bb = sample(die.face, 20, replace = T) 
table(bb) 

## bb 

##123456 


##24145 4 


Notice the difference in the results of table aa and table bb. You can expect 
similar variation in your results when manually rolling the die. 


Critical Evaluation 


We need to critically evaluate the statistical studies we read about and 
analyze before accepting the results of the study. Common problems to be 
aware of include 


e Problems with Samples: A sample should be representative of the 
population. A sample that is not representative of the population is 
biased. Biased samples that are not representative of the population 
give results that are inaccurate and not valid. 

¢ Self-Selected Samples: Responses only by people who choose to 
respond, such as call-in surveys are often unreliable. 

e Sample Size Issues: Samples that are too small may be unreliable. 
Larger samples are better if possible. In some situations, small samples 
are unavoidable and can still be used to draw conclusions, even though 
larger samples are better. Examples: Crash testing cars, medical testing 
for rare conditions. 

e Undue influence: Collecting data or asking questions in a way that 
influences the response. 

e Non-response or refusal of subject to participate: The collected 
responses may no longer be representative of the population. Often, 
people with strong positive or negative opinions may answer surveys, 
which can affect the results. 

e Causality: A relationship between two variables does not mean that 
one causes the other to occur. They may both be related (correlated) 
because of their relationship through a different variable. 

e Self-Funded or Self-Interest Studies: A study performed by a person or 
organization in order to support their claim. Is the study impartial? 
Read the study carefully to evaluate the work. Do not automatically 
assume that the study is good but do not automatically assume the 
study is bad either. Evaluate it on its merits and the work done. 


e Misleading Use of Data: Improperly displayed graphs, incomplete 
data, lack of context. 

e Confounding: When the effects of multiple factors on a response 
cannot be separated. Confounding makes it difficult or impossible to 
draw valid conclusions about the effect of each factor. 


Glossary 


Population 
The collection, or set, of all individuals, objects, or measurements 
whose properties are being studied. 


Sample 
A portion of the population understudy. A sample is representative if it 
characterizes the population being studied. 


Answers and Rounding Off 
This module briefly explains the correct way to round off answers when 
working with statistical data. 


A simple way to round off answers is to carry your final answer one more 
decimal place than was present in the original data. Round only the final 
answer. Do not round any intermediate results, if possible. If it becomes 
necessary to round intermediate results, carry them to at least twice as many 
decimal places as the final answer. For example, the average of the three 
quiz scores 4, 6, 9 is 6.3, rounded to the nearest tenth, because the data are 
whole numbers. Most answers will be rounded in this manner. 


It is not necessary to reduce most fractions in this course. Especially in 
Probability Topics, the chapter on probability, it is more helpful to leave an 
answer as an unreduced fraction. 


Frequency, Relative Frequency, and Cumulative Frequency 

This module introduces the concepts of frequency, relative frequency, and 
cumulative relative frequency, and the relationship between these measures. 
Students will have the opportunity to interpret data through the sample problems 
provided. 


Twenty students were asked how many hours they worked per day. Their 
responses, in hours, are listed below, followed by a frequency table listing the 
different data values in ascending order and their frequencies. 


Dy Oy 95 Sy 2p Ay 75. Dy 25 By Oy Oy yA Aid, Oy 2 god 


Data Value Frequency 
2 3 
3 5 
4 3 
5 6 
6 Z 
7 1 


Frequency Table of Student Work Hours 


A frequency is the number of times a given datum occurs in a data set. 
According to the table above, there are three students who work 2 hours, five 
students who work 3 hours, etc. The total of the frequency column, 20, represents 
the total number of students included in the sample. A relative frequency is the 
fraction of times an answer occurs. To find the relative frequencies, divide each 
frequency by the total number of students in the sample - in this case, 20. 
Relative frequencies can be written as fractions, percents, or decimals. 


Cumulative relative frequency is the accumulation of the previous relative 
frequencies. To find the cumulative relative frequencies, add all the previous 
relative frequencies to the relative frequency for the current row. 


Data Relative Cumulative Relative 
Value Frequency Frequency Frequency 

2 3 # or 0.15 0:15 

3 5 3 or 0.25 0.15 + 0.25 = 0.40 

4 3 3 or 0.15 0.40 + 0.15 = 0.55 

5 6 si or 0.30 0.55 + 0.30 = 0.85 

6 2 + or 0.10 0.85 + 0.10 = 0.95 

7 1 35 oF 0.05 0.95 + 0.05 = 1.00 


Frequency Table of Student Work Hours w/ Relative and Cumulative Relative 
Frequency 


Note: 


e The sum of the relative frequency column is 3 or 1. 

e The last entry of the cumulative relative frequency column is one, 
indicating that one hundred percent of the data has been accumulated. 

¢ Because of rounding, the relative frequency column may not always sum to 
one and the last entry in the cumulative relative frequency column may not 


be one. However, they each should be close to one. 


Using R for Calculating Frequency, Cumulative Frequency, 
Relative Frequency, and Cumulative Relative Frequency 


As you can imagine, it is pretty straightforward to do something like this in R. 
The following functions apply: 


1. Frequencies: table(). 

2. Relative frequencies: table( ) divided by Length( ) (which tells us how 
many items are in an R object). 

3. Cumulative frequencies: First we create intervals using Cut ( ), then we use 
Ccumsum( ). Note: Set your cut ( ) breaks ranging from one below your 
minimum value to your maximum value. 

4. Cumulative relative frequencies: First, calculate your cumulative 
frequencies (see item 3) and divide that by the total number of observations 
(obtained using length( )). 


# Entering the data 
hours.worked = c(5, 6, 3, 3, 2, 4, 7, 5, 2, 3, 5, 6, 
5, 4, 4, 3, 5, 2, 5, 3) 


# A general frequency table 
table(hours.worked ) 

## hours.worked 

## 234567 

## 353621 


# Relative frequency table 
table(hours.worked)/length(hours.worked) 
## hours.worked 

HH 2 3 4 5 6 7 

## 0.15 0.25 0.15 0.30 0.10 0.05 


# To get cumulative frequencies, we need to put 
# the hours into different intervals 
x = table(cut(hours.worked, breaks = c(1:7))) 


# Cumulative frequencies 
cumsum( x) 


## (1,2] (2,3] (3,4] (4,5] (5,6] (6,7] 
#H 3 8 D207 19 20 


# Cumulative relative frequencies 
cumsum(x)/length(hours.worked ) 

## (1,2] (2,3] (3,4] (4,5] (5,6] (6,7] 
## 0.15 0.40 0.55 0.85 0.95 1.00 


The following table represents the heights, in inches, of a sample of 100 male 


semiprofessional soccer players. 


FREQUENCY 
HEIGHTS OF RELATIVE 
(INCHES) | STUDENTS FREQUENCY 
sis Tay = 0.05 
ae fr=00 
oe. 15 15. = 0.15 
eee 40 40. = 0.40 
ee ; 17 047 
Total = 100 Total = 1.00 


CUMULATIVE 
RELATIVE 
FREQUENCY 


0.05 


0.05 + 0.03 = 
0.08 


0.08 + 0.15 = 
0.23 


0.23 + 0.40 = 
0.63 


0.63 + 0.17 = 
0.80 


HEIGHTS 
(INCHES) 


69.95 - 
71.95 


71.95 - 
73.00 


73.95 - 
75.95 


FREQUENCY 
OF 
STUDENTS 


12 


Total = 100 


RELATIVE 

FREQUENCY 
12 _ 

i990 = 0.12 
i 

aa 0.07 
i 

ao 7 01 

Total = 1.00 


Frequency Table of Soccer Player Height 


CUMULATIVE 
RELATIVE 
FREQUENCY 


0.80 + 0.12 = 
0.92 


0.92 + 0.07 = 
0.99 


0.99 + 0.01 = 
1.00 


The data in this table has been grouped into the following intervals: 


59.95 - 61.95 inches 
61.95 - 63.95 inches 
63.95 - 65.95 inches 
65.95 - 67.95 inches 
67.95 - 69.95 inches 
69.95 - 71.95 inches 
71.95 - 73.95 inches 
73.95 - 75.95 inches 


Note: This example is used again in the Descriptive Statistics chapter, where the 
method used to compute the intervals will be explained. 


In this sample, there are 5 players whose heights are between 59.95 - 61.95 
inches, 3 players whose heights fall within the interval 61.95 - 63.95 inches, 15 
players whose heights fall within the interval 63.95 - 65.95 inches, 40 players 
whose heights fall within the interval 65.95 - 67.95 inches, 17 players whose 
heights fall within the interval 67.95 - 69.95 inches, 12 players whose heights fall 


within the interval 69.95 - 71.95, 7 players whose height falls within the interval 
71.95 - 73.95, and 1 player whose height falls within the interval 73.95 - 75.95. 
All heights fall between the endpoints of an interval and not at the endpoints. 


Example: 
Exercise: 


Problem: 


From the table, find the percentage of heights that are less than 65.95 
inches. 


Solution: 


If you look at the first, second, and third rows, the heights are all less than 
65.95 inches. There are 5 + 3 + 15 = 23 males whose heights are less than 


65.95 inches. The percentage of heights less than 65.95 inches is then # 


or 23%. This percentage is the cumulative relative frequency entry in the 
third row. 


Example: 
Exercise: 


Problem: 


From the table, find the percentage of heights that fall between 61.95 and 
65.95 inches. 


Solution: 


Add the relative frequencies in the second and third rows: 0.03 + 0.15 = 
0.18 or 18%. 


Example: 
Exercise: 


Problem: 


Use the table of heights of the 100 male semiprofessional soccer players. 
Fill in the blanks and check your answers. 


. The percentage of heights that are from 67.95 to 71.95 inches is: 

. The percentage of heights that are from 67.95 to 73.95 inches is: 

. The percentage of heights that are more than 65.95 inches is: 

. The number of players in the sample who are between 61.95 and 71.95 
inches tall is: 

5. What kind of data are the heights? 

6. Describe how you could gather this data (the heights) so that the data 

are characteristic of all male semiprofessional soccer players. 


BRWN Fe 


Remember, you count frequencies. To find the relative frequency, divide 
the frequency by the total number of data values. To find the cumulative 
relative frequency, add all of the previous relative frequencies to the 
relative frequency for the current row. 


Solution: 


2a 

2. 36% 

Aho 

4. 87 

5. quantitative continuous 

6. get rosters from each team and choose a simple random sample from 
each 


Glossary 


Frequency 
The number of times a value of the data occurs. 


Relative Frequency 
The ratio of the number of times a value of the data occurs in the set of all 
outcomes to the number of all outcomes. 


Cumulative Relative Frequency 
The term applies to an ordered set of observations from smallest to largest. 
The Cumulative Relative Frequency is the sum of the relative frequencies 
for all values that are less than or equal to the given value. 


Practice: Sampling and Data 

This module provides an opportunity for students to practice concepts 
related to statistical sampling and data. Given a sample data set, the student 
will practice constructing frequency tables, differentiating between key 
terms, and comparing sampling techniques. 


Student Learning Outcomes 


e The student will practice constructing frequency tables. 
e The student will compare sampling techniques. 


Given 


Studies are often done by pharmaceutical companies to determine the 
effectiveness of a treatment program. Suppose that a new AIDS antibody 
drug is currently under study. It is given to patients once the AIDS 
symptoms have revealed themselves. Of interest is the average length of 
time in months patients live once starting the treatment. Two researchers 
each follow a different set of 40 AIDS patients from the start of treatment 
until their deaths. The following data (in months) are collected. 


Researcher 1: 3, 4, 11, 15, 16, 17, 22, 44, 37, 16, 14, 24, 25, 15, 26, 27, 33, 
29, 35, 44, 13, 21, 22, 10, 12, 8, 40, 32, 26, 27, 31, 34, 29, 17, 8, 24, 18, 47, 
33, 34 

Researcher 2: 3, 14, 11, 5, 16, 17, 28, 41, 31, 18, 14, 14, 26, 25, 21, 22, 31, 
2, 35, 44, 23, 21, 21, 16, 12, 18, 41, 22, 16, 25, 33, 34, 29, 13, 18, 24, 23, 
42, 33, 29 

Organize the Data 


Complete the tables below using the data provided. 


Cumulative 
Survival Length Relative Rel. 
(in months) Frequency Frequency Frequency 
0.5-6.5 
6.5-12.5 
12.5- 18.5 
18.5 - 24.5 
24.5 - 30.5 
30.5 - 36.5 
36.5 - 42.5 


42.5 - 48.5 


Researcher 1 


Cumulative 
Survival Length Relative Rel. 
(in months) Frequency Frequency Frequency 
0.5-6.5 
6.5-12.5 


12.5 - 18.5 


Cumulative 
Survival Length Relative Rel. 
(in months) Frequency Frequency Frequency 


18.5 - 24.5 
24.5 - 30.5 
30.5 - 36.5 
36.5 - 42.5 
42.5 - 48.5 


Researcher 2 


Discussion Questions 
Discuss the following questions and then answer in complete sentences. 
Exercise: 
Problem: 
List two reasons why the data may differ. Would you expect the data to 
be identical? Why or why not? 
Exercise: 
Problem: 
Can you tell if one researcher is correct and the other one is incorrect? 
Why? 


Exercise: 


Problem: How could the researchers gather random data? 


Exercise: 


Problem: 


Suppose that the first researcher conducted his survey by randomly 
choosing one state in the nation and then randomly picking 40 patients 
from that state. What sampling method would that researcher have 
used? 


Exercise: 
Problem: 
Suppose that the second researcher conducted his survey by choosing 
AO patients he knew. What sampling method would that researcher 


have used? What concerns would you have about this data set, based 
upon the data collection method? 


Introduction 


Student Learning Objectives 
By the end of this chapter, the student should be able to: 


e Display data graphically and interpret graphs: stemplots, histograms 
and boxplots. 

e Recognize, describe, and calculate the measures of location of data: 
quartiles and percentiles. 

e Recognize, describe, and calculate the measures of the center of data: 
mean, median, and mode. 

e Recognize, describe, and calculate the measures of the spread of data: 
variance, standard deviation, and range. 

e Use R to generate basic graphs that help to interpret data. 


The main R functions that will be used in this chapter include barplot(), 
boxplot(), hist(), seq(), and stem(). 


Introduction 


Once you have collected data, what will you do with it? Data can be 
described and presented in many different formats. For example, suppose 
you are interested in buying a house in a particular area. You may have no 
clue about the house prices, so you might ask your real estate agent to give 
you a sample data set of prices. Looking at all the prices in the sample often 
is overwhelming. A better way might be to look at the median price and the 
variation of prices. The median and variation are just two ways that you 
will learn to describe data. Your agent might also provide you with a graph 
of the data. 


In this chapter, you will study numerical and graphical ways to describe and 
display your data. This area of statistics is called "Descriptive Statistics". 
You will learn to calculate, and even more importantly, to interpret these 
measurements and graphs. 


Stem and Leaf Graphs and Bar Graphs 
This module introduces the use of stem-and-leaf graphs (stemplots), line 
graphs and bar graphs for describing a set of data visually. 


One simple graph, the stem-and-leaf graph or stemplot, comes from the 
field of exploratory data analysis.It is a good choice when the data sets are 
small. To create the plot, divide each observation of data into a stem and a 
leaf. The leaf consists of one digit. For example, 23 has stem 2 and leaf 3. 
Four hundred thirty-two (432) has stem 43 and leaf 2. Five thousand four 
hundred thirty-two (5,432) has stem 543 and leaf 2. The decimal 9.3 has 
stem 9 and leaf 3. Write the stems in a vertical line from smallest the 
largest. Draw a vertical line to the right of the stems. Then write the leaves 
in increasing order next to their corresponding stem. 


Example: 

For Susan Dean's spring pre-calculus class, scores for the first exam were 
as follows (smallest to largest): 

Sion Pale ber SWAG a lov eerspacdiausism ove dolokuetermec dow ea Niow 1 wytopro ll else 
88, 88, 88, 90, 92, 94, 94, 94, 94, 96, 100 


Stem Leaf 
3 3 

4 299 
5 355 


6 1378899 


Stem Leaf 


7 2348 

8 03888 

9 0244446 
10 0 


Stem-and-Leaf Diagram 


The stemplot shows that most scores fell in the 60s, 70s, 80s, and 90s. 
Eight out of the 31 scores or approximately 26% of the scores were in the 
90's or 100, a fairly high number of As. 


The stemplot is a quick way to graph and gives an exact picture of the data. 
You want to look for an overall pattern and any outliers. An outlier is an 
observation of data that does not fit the rest of the data. It is sometimes 
called an extreme value. When you graph an outlier, it will appear not to fit 
the pattern of the graph. Some outliers are due to mistakes (for example, 
writing down 50 instead of 500) while others may indicate that something 
unusual is happening. It takes some background information to explain 
outliers. In the example above, there were no outliers. 


Example: 

Create a stem plot using the data: 

28,2 2/32 Ons Os O.0, ae ae, 4 ao, des AO Ose, 
Peon O.5,0.7-012 75 

The data are the distance (in kilometers) from a home to the nearest 
supermarket. 

Exercise: 


Problem: 


1. Are there any outliers? 
2. Do the data seem to have any concentration of values? 


Note:The leaves are to the right of the decimal. 


Solution: 


The value 12.3 may be an outlier. Values appear to concentrate at 3 
and 4 miles. 


Stem Leaf 

1 15 

Z Saou 

3 33358 
4 025578 
5 566 

6 Bf 

oy 


Stem Leaf 


11 


12 3 


Stem and Leaf Plots in R 


The stem( ) function in R is used to create stem and leaf plots. For most 
purposes, you do not need to pass on any other arguments; however, you 
may occasionally, need to use the Scale argument to get a more usable 
stem and leaf plot. Notice that R tells you where the decimal is in its output 
so that you can easily figure out the original values. 


exam.scores = c(33, 42, 49, 49, 53, 55, 55, 61, 
63, 

67, 68, 68, 69, 69, 72, 73, 74, 78, 80, 83, 
88, 

88, 88, 90, 92, 94, 94, 94, 94, 96, 100) 
stem(exam.scores ) 
HH 
## The decimal point is 1 digit(s) to the right 
of the | 
HH 
HH 3 
HH 4 | 299 
HH 5 


#H 6 | 1378899 
#H 7 | 2348 

#H 8 | 03888 
## 9 | 0244446 
## 10 | 0 


Bar graphs consist of bars that are separated from each other. The bars can 
be rectangles or they can be be rectangular boxes and they can be vertical or 
horizontal. The bar graph shown in Example 4 uses the data of Example 3. 
Frequencies are represented by the the heights of the bars. 


Example: 

In a survey, 40 mothers were asked how many times per week a teenager 
must be reminded to do his/her chores. The results are shown in the table 
and the bar graph. 


Number of times teenager is reminded Frequency 
0 2 

1 5 

2 8 

3 14 

iE if 


Example: 


Frequency 8 ~ 
6 -+ 
4-4 

; 7 

8) 1 2 3 = 5 


Number of Times Teenager is 
Reminded 


Bar Graphs in R 


To create a bar graph in R, you need to use the barplot( ) function. 
There are several useful arguments for this function including: 


e names.arg: An object representing the values to be placed below 
each bar (usually the x-axis). 

e Xlab: The title for your x-axis. 

e ylab: The title for your y-axis. 

e main: The title of your graph. 


reminders = c(0, 1, 2, 3, 7, 5) 

frequency = c(2, 5, 8, 14, 7, 4) 

barplot(frequency, names.arg=reminders, 
Xlab="Number of reminders", 


ylab="Frequency of responses", 
main="Number of times each week 
that a teenager is reminded to do chores") 


Number of times each week 
that a teenager is reminded to do chores 


14 


Frequency of responses 
0 2 4 6 8 10 


ails 


Number of reminders 


The bar graph shown in Example 5 has age groups represented on the x- 
axis and proportions on the y-axis. 


Example: 

By the end of March 2009, in the United States Facebook had over 56 
million users. The table shows the age groups, the number of users in each 
age group and the proportion (%) of users in each age group. Source: 
http://www.insidefacebook.com/2009/03/25/number-of-us-facebook- 
users-over-35-nearly-doubles-in-last-60-days/ 


Age Number of Proportion (%) of 
groups Facebook users Facebook users 
13-25 25,510,040 46% 
26 - 44 23,123,900 41% 
45-65 7,431,020 13% 
50 
45 
40 
35 
30 
Proportion(%) 25 
20 
15 
10 
5 
0 
Ages 
45-65 
Glossary 
Outlier 


An observation that does not fit the rest of the data. 


Histograms 

This module provides an overview of Descriptive Statistics: Histogram as a 
part of Collaborative Statistics collection (col10522) by Barbara Illowsky 
and Susan Dean. 


For most of the work you do in this book, you will use a histogram to 
display the data. One advantage of a histogram is that it can readily display 
large data sets. A rule of thumb is to use a histogram when the data set 
consists of 100 values or more. 


A histogram consists of contiguous boxes. It has both a horizontal axis and 
a vertical axis. The horizontal axis is labeled with what the data represents 
(for instance, distance from your home to school). The vertical axis is 
labeled either "frequency" or "relative frequency". The graph will have the 
same shape with either label. Frequency is commonly used when the data 
set is small and relative frequency is used when the data set is large or 
when we want to compare several distributions. The histogram (like the 
stemplot) can give you the shape of the data, the center, and the spread of 
the data. 


The relative frequency is equal to the frequency for an observed value of 
the data divided by the total number of data values in the sample. (In the 
chapter on Sampling and Data, we defined frequency as the number of 
times an answer occurs.) If: 


e f = frequency 

e n= total number of data values (or the sum of the individual 
frequencies), and 

e RF = relative frequency, 


then: 
Equation: 


RF = 


3 | 


For example, if 3 students in Mr. Ahab's English class of 40 students 
received an A, then, 


= = a a ee 
f=3,n=40,andRF = — qo = 0.075 


Seven and a half percent of the students received an A. 


To construct a histogram, first decide how many bars or intervals, also 
called classes, represent the data. Many histograms consist of from 5 to 15 
bars or classes for clarity. Choose a starting point for the first interval to be 
less than the smallest data value. A convenient starting point is a lower 
value carried out to one more decimal place than the value with the most 
decimal places. For example, if the value with the most decimal places is 
6.1 and this is the smallest value, a convenient starting point is 6.05 (6.1 - 
0.05 = 6.05). We say that 6.05 has more precision. If the value with the 
most decimal places is 2.23 and the lowest value is 1.5, a convenient 
Starting point is 1.495 (1.5 - 0.005 = 1.495). If the value with the most 
decimal places is 3.234 and the lowest value is 1.0, a convenient starting 
point is 0.9995 (1.0 - .0005 = 0.9995). If all the data happen to be integers 
and the smallest value is 2, then a convenient starting point is 1.5 (2 - 0.5 = 
1.5). Also, when the starting point and other boundaries are carried to one 
additional decimal place, no data value will fall on a boundary. 


Example: 

The following data are the heights (in inches to the nearest half inch) of 
100 male semiprofessional soccer players. The heights are continuous data 
since height is measured. 

60, 60.5, 61, 61, 61.5, 63.5, 63.5, 63.5, 64, 64, 64, 64, 64, 64, 64, 64.5, 
64.5, 64.5, 64.5, 64.5, 64.5, 64.5, 64.5, 66, 66, 66, 66, 66, 66, 66, 66, 66, 
66, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 66.5, 67, 67, 
67207510770 70.007 207.070) 20 On07 oO 71a 07a Oro: aaa: 
720, G0706,009) G9, 09709104, 69.09 .o0).0909700-0,05 75/00: a G03, 
CO Re AOR TAU Als Al 8 ere UMD on erd Dow 40 Wa wala Ease Walaa euee 726 UZ ole 
Jot os Ovo 74 


The smallest data value is 60. Since the data with the most decimal places 
has one decimal (for instance, 61.5), we want our starting point to have two 
decimal places. Since the numbers 0.5, 0.05, 0.005, etc. are convenient 
numbers, use 0.05 and subtract it from 60, the smallest value, for the 
convenient starting point. 

60 - 0.05 = 59.95 which is more precise than, say, 61.5 by one decimal 
place. The starting point is, then, 59.95. The largest value is 74. 74+ 0.05 = 
74.05 is the ending value. 

Next, calculate the width of each bar or class interval. To calculate this 
width, subtract the starting point from the ending value and divide by the 
number of bars (you must choose the number of bars you desire). Suppose 
you choose 8 bars. 

Equation: 


74.05 — 59.95 


— a7 
8 


Note: We will round up to 2 and make each bar or class interval 2 units 
wide. Rounding up to 2 is one way to prevent a value from falling on a 
boundary. For this example, using 1.76 as the width would also work. 


The boundaries are: 


Lore elo. 

(Je raye Melo nan aed Lj) 
Se 6t pe as 
Oioey oan Aes o}e ba )5) 
BO.go biz = 107.00 
GY hoe le is ie \0}e) 45 Fa 
ats ae be as eee yi US Fa 
7LQS + 2 = 73:95 
(pois ey ar ets /Fa sis )9) 


The heights 60 through 61.5 inches are in the interval 59.95 - 61.95. The 
heights that are 63.5 are in the interval 61.95 - 63.95. The heights that are 
64 through 64.5 are in the interval 63.95 - 65.95. The heights 66 through 
67.5 are in the interval 65.95 - 67.95. The heights 68 through 69.5 are in 
the interval 67.95 - 69.95. The heights 70 through 71 are in the interval 
69.95 - 71.95. The heights 72 through 73.5 are in the interval 71.95 - 
73.95. The height 74 is in the interval 73.95 - 75.95. 

The following histogram displays the heights on the x-axis and relative 
frequency on the y-axis. 


Relative 
Frequency 


04 


o9.93 61.95 63.95 69.99 61.95 63.99 T1935 13.53 19.5 


Heights 


Creating a Histogram in R 


Because histograms are so often used in statistical analysis, as you can 
imagine, R is able to generate histograms quite easily. The functions you 
will use are Seq( ) to generate the required intervals and hist ( ) for 


generating the histogram. Additionally, you may use the following 
arguments with the hist ( ) function: 


e breaks: Used to tell R how many breaks the histogram should have 
or where the intervals should be. 

e Xlab: This will add a label to your x-axis. 

e ylab: This will add a label to your y-axis. 

e main: Used to add a chart title. 


player.height = c(60, 60.5, 61, 61, 61.5, 63.5, 
63.5, 63.5, 64, 64, 64, 64, 64, 
64, 64, 64.5, 64.5, 64.5, 64.5, 
64.5, 64.5, 64.5, 64.5, 66, 66, 
66, 66, 66, 66, 66, 66, 66, 66, 
66.5, 66.5, 66.5, 66.5, 66.5, 
66.5, 66.5, 66.5, 66.5, 66.5, 
66.5, 67, 67, 67, 67, 67, 67, 
67, 67, 67, 67, 67, 67, 67.5, 
O10, O7.5, O7 20; ‘Gib, 6725, 
67.5, 68, 68, 69, 69, 69, 69, 
69, 69, 69, 69, 69, 69, 69.5, 
69.5, 69.5, 69.5, 69.5, 70, 70, 
70, 70, 70, 70, 70.5, 70.5, 
70.5, 71, 71, 71, 72, 72, 72, 
72.5, 72.5, 73, 73.5, 74) 
hist.breaks = seq(from = 59.95, by = 2, 
length = 9) 
hist(player.height, breaks = hist.breaks, 
Xlab = "Player Heights", ylab = "Frequency", 
main = "Heights (in inches) 
of 100 Soccer Players") 


Heights (in inches) 
of 100 Soccer Players 


Frequency 
20 


60 65 70 75 


Player Heights 


Note that the above histogram is of frequencies, not relative frequencies. 


Example: 

The following data are the number of books bought by 50 part-time college 
students at ABC College. The number of books is discrete data since books 
are counted. 

ity a Bes Eat Ug Heed Wes BTS Feta ead Bee Ue Mean 2 ae URae Bao are ate spo Lo No Pro hao » 
By Os Osby Senos 4s 4 4, ANA A 5 3s 0,7 D3-53-OyG 

Eleven students buy 1 book. Ten students buy 2 books. Sixteen students 
buy 3 books. Six students buy 4 books. Five students buy 5 books. Two 
students buy 6 books. 

Because the data are integers, subtract 0.5 from 1, the smallest data value 
and add 0.5 to 6, the largest data value. Then the starting point is 0.5 and 
the ending value is 6.5. 

Exercise: 


Problem: 


Calculate the width of each bar or class interval. If the data are 
discrete and there are not too many different values, a width that 
places the data values in the middle of the bar or class interval is the 
most convenient. Since the data consist of the numbers 1, 2, 3, 4, 5, 6 
and the starting point is 0.5, a width of one places the 1 in the middle 
of the interval from 0.5 to 1.5, the 2 in the middle of the interval from 
1.5 to 2.5, the 3 in the middle of the interval from 2.5 to 3.5, the 4 in 
the middle of the interval from to , the 5 in the 
middle of the interval from to , and the in 
the middle of the interval from to 


Solution: 


e 3.5t0 4.5 
e 4.510 5.5 
° 6 

e255 (0 G15 


Calculate the number of bars as follows: 
Equation: 
6.5—0.5 _ 1 
bars 
where 1 is the width of a bar. Therefore, bars = 6. 


The following histogram displays the number of books on the x-axis and 
the frequency on the y-axis. 
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Number of Books 
Glossary 
Frequency 


The number of times a value of the data occurs. 


Relative Frequency 
The ratio of the number of times a value of the data occurs in the set of 


all outcomes to the number of all outcomes. 


Box Plots 
Boxplots are useful for getting a sense of the distribution of a variable. 


Box plots or box-whisker plots give a good graphical image of the 
concentration of the data. They also show how far from most of the data the 
extreme values are. The box plot is constructed from five values: the 
smallest value, the first quartile, the median, the third quartile, and the 
largest value. The median, the first quartile, and the third quartile will be 
discussed here, and then again in the section on measuring data in this 
chapter. We use these values to compare how close other data values are to 
them. 


The median, a number, is a way of measuring the "center" of the data. You 
can think of the median as the "middle value," although it does not actually 
have to be one of the observed values. It is a number that separates ordered 
data into halves. Half the values are the same number or smaller than the 
median and half the values are the same number or larger. For example, 
consider the following data: 


1,115, 6, 7.2,4, 8,9). 10, 6.8,.8.3;,2,2, 10, 1 
Ordered from smallest to largest: 
11-252; 4, 6; 6.8, 7.2, 8, 8,3, 9, 10, 10, 11,5 


The median is between the 7th value, 6.8, and the 8th value 7.2. To find the 
median, add the two values together and divide by 2. 
Equation: 


6847.2 _ 


7 
2 


The median is 7. Half of the values are smaller than 7 and half of the values 
are larger than 7. 


Quartiles are numbers that separate the data into quarters. Quartiles may or 
may not be part of the data. To find the quartiles, first find the median or 


second quartile. The first quartile is the middle value of the lower half of 
the data and the third quartile is the middle value of the upper half of the 
data. To get the idea, consider the same data set shown above. 


The median or second quartile is 7. The lower half of the data is 1, 1, 2, 2, 
4, 6, 6.8. The middle value of the lower half is 2. 


1, 1, 2, 2)4; 6,68 


The number 2, which is part of the data, is the first quartile. One-fourth of 
the values are the same or less than 2 and three-fourths of the values are 
more than 2. 


The upper half of the data is 7.2, 8, 8.3, 9, 10, 10, 11.5. The middle value of 
the upper half is 9. 


7 25,0; Gio) 9, 10,10, 71.5 


The number 9, which is part of the data, is the third quartile. Three-fourths 
of the values are less than 9 and one-fourth of the values are more than 9. 


To construct a box plot, use a horizontal number line and a rectangular box. 
The smallest and largest data values label the endpoints of the axis. The first 
quartile marks one end of the box and the third quartile marks the other end 
of the box. The middle fifty percent of the data fall inside the box. The 
"whiskers" extend from the ends of the box to the smallest and largest data 
values. The box plot gives a good quick picture of the data. 


Using the data from the start of this section, recall that the first quartile is 2, 
the median is 7, and the third quartile is 9. The smallest value is 1 and the 
largest value is 11.5. The box plot is constructed as follows: 


1 Z 3 = 5 6 7 8 9 10 11 11.5 


The two whiskers extend from the first quartile to the smallest value and 
from the third quartile to the largest value. The median is shown with a 
dashed line. 


Making a Box Plot in R 


As with histograms, box plots are very easy to make in R. They are made 
using the boxplot() function. By default, R orients the box plot 
vertically; to change this, simply add the argument horizontal = 
TRUE to the function. For most simple plots, no additional arguments are 
required. You may also want to change the size of the R graph window to 
view the box plot in a more aesthetic scale. 


Here is how you would create the box plot in R using the same data from 
the start of this section. 


a= c(i, 11.5, 6, 7.2, 4, 8, 9, 
10, 6.8, 8.3, 2, 2; 10, 1) 
boxplot(a, horizontal = TRUE) 


Example: 


The following data are the heights of 40 students in a statistics class. 

59, 60, 61, 62, 62, 63, 63, 64, 64, 64, 65, 65, 65, 65, 65, 65, 65, 65, 65, 66, 
OG; 67, 67,08; 66, G9. 70, 70, 70, 70; 70, 71,71 72, 72.73, 74, 74575, 77 
Construct a box plot with the following properties: 


Smallest value = 59 

Largest value = 77 

Q1: First quartile = 64.5 

Q2: Second quartile or median= 66 
Q3: Third quartile = 70 


59 645 66 70 77 


e aEach quarter has 25% of the data. 

e bThe spreads of the four quarters are 64.5 - 59 = 5.5 (first quarter), 66 
- 64.5 = 1.5 (second quarter), 70 - 66 = 4 (3rd quarter), and 77 - 70 = 
7 (fourth quarter). So, the second quarter has the smallest spread and 
the fourth quarter has the largest spread. 

e cinterquartile Range: IQR = Q3 — Q1 = 70 — 64.5 = 5.5. 

e dThe interval 59 through 65 has more than 25% of the data so it has 
more data in it than the interval 66 through 70 which has 25% of the 
data. 


For some sets of data, some of the largest value, smallest value, first 
quartile, median, and third quartile may be the same. For instance, you 
might have a data set in which the median and the third quartile are the 
same. In this case, the diagram would not have a dotted line inside the box 
displaying the median. The right side of the box would display both the 
third quartile and the median. For example, if the smallest value and the 
first quartile were both 1, the median and the third quartile were both 5, 
and the largest value was 7, the box plot would look as follows: 


Example: 

Test scores for a college statistics class held during the day are: 

DO. 56, 78) 55-5552. 9040050056, 005459077 .04.5.,04, 70) 72. Oh. e279. 
90 

Test scores for a college statistics class held during the evening are: 
95,78) 08,83, Sil 89) 885 76, 05, 45,908,900) d0, 84.5285, 79, 76, 98, 90) 
VAS poy beg 5335) 

Exercise: 


Problem: 


e What are the smallest and largest data values for each data set? 

e What is the median, the first quartile, and the third quartile for 
each data set? 

e Create a boxplot for each set of data. 

e¢ Which boxplot has the widest spread for the middle 50% of the 
data (the data between the first and third quartiles)? What does 
this mean for that set of data in comparison to the other set of 
data? 

e For each data set, what percent of the data is between the 
smallest value and the first quartile? (Answer: 25%) the first 
quartile and the median? (Answer: 25%) the median and the third 
quartile? the third quartile and the largest value? What percent of 
the data is between the first quartile and the largest value? 
(Answer: 75%) 


Solution: 


First Data Set 


e Xmin = 32 
« QI = 56 
eM 74.5 
©:@3 — 625 
e Xmax = 99 


Second Data Set 


e Xmin = 25.5 
« Ql = 78 
e M=8l 
« Q3 = 89 
e Xmax = 98 


—L__i FF —- 
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The first data set (the top box plot) has the widest spread for the middle 
50% of the data. IQR = Q3 — Q1 is 82.5 — 56 = 26.5 for the first data 
set and 89 — 78 = 11 for the second data set. So, the first set of data has 
its middle 50% of scores more spread out. 

25% of the data is between M and Q3 and 25% is between Q3 and Xmax. 


Glossary 


Median 


A number that separates ordered data into halves: half the values are 
the same number or smaller than the median and half the values are the 
same number or larger than the median. The median may or may not 
be part of the data. 


Quartiles 
The numbers that separate the data into quarters. Quartiles may or may 
not be part of the data. The second quartile is the median of the data. 


Measuring the Location of the Data 

Descriptive Statistics: Measuring the Location of Data explains percentiles and 
quartiles and is part of the collection col10555 written by Barbara Illowsky and 
Susan Dean. Roberta Bloom contributed the section "Interpreting Percentiles, 
Quartile and the Median." 


The common measures of location are quartiles and percentiles (%iles). Quartiles 
are special percentiles. The first quartile, Q , is the same as the 25th percentile 
(25th %ile) and the third quartile, @ 3, is the same as the 75th percentile (75th 
%ile). The median, M, is called both the second quartile and the 50th percentile 
(50th %ile). 


To calculate quartiles and percentiles, the data must be ordered from smallest to 
largest. Recall that quartiles divide ordered data into quarters. Percentiles divide 
ordered data into hundredths. To score in the 90th percentile of an exam does not 
mean, necessarily, that you received 90% on a test. It means that 90% of test 
scores are the same or less than your score and 10% of the test scores are the same 
or greater than your test score. 


Percentiles are useful for comparing values. For this reason, universities and 
colleges use percentiles extensively. 


Percentiles are mostly used with very large populations. Therefore, if you were to 
say that 90% of the test scores are less (and not the same or less) than your score, 
it would be acceptable because removing one particular data value is not 
significant. 


The interquartile range is a number that indicates the spread of the middle half 
or the middle 50% of the data. It is the difference between the third quartile (Q 3) 
and the first quartile (Q ;). 

Equation: 


IQR =Q3— Q1 


The IQR can help to determine potential outliers. A value is suspected to be a 
potential outlier if it is less than (1.5)(IQR) below the first quartile or more 
than (1.5)(IQR) above the third quartile. Potential outliers always need further 
investigation. 


Example: 
Exercise: 


Problem: 

For the following 13 real estate prices, calculate the IQR and determine if 
any prices are outliers. Prices are in dollars. (Source: San Jose Mercury 
News) 


389950, 230500, 158000, 479000, 639000, 114950, 5500000, 387000, 
659000, 529000, 575000, 488800, 1095000 


Solution: 
Order the data from smallest to largest. 


114950, 158000, 230500, 387000, 389950, 479000, 488800, 529000, 
5975000, 639000, 659000, 1095000, 5500000 


M = 488,800 
CQ = 2808004387000 — 308750 

Q3 = $820001859000 — 649000 

IQR = 649000 — 308750 = 340250 

(1.5) (IQR) = (1.5)(340250) = 510375 

Q, — (1.5)(IQR) = 308750 — 510375 = —201625 
Q3 + (1.5)(IQR) = 649000 + 510375 = 1159375 


No house price is less than -201625. However, 5,500,000 is more than 
1,159,375. Therefore, 5,500,000 is a potential outlier. 


Example: 
Exercise: 


Problem: 
For the two data sets in the test scores example, find the following: 


e aThe interquartile range. Compare the two interquartile ranges. 

e bAny outliers in either set. 

¢ cThe 30th percentile and the 80th percentile for each set. How much 
data falls below the 30th percentile? Above the 80th percentile? 


Solution: 


For the IQRs, see the answer to the test scores example. The first data set has 
the larger IQR, so the scores between Q3 and Q1 (middle 50%) for the first 
data set are more spread out and not clustered about the median. 


First Data Set 


(3) : (IQR) = 39.75 is larger than 16.5 and larger than 24, so the first set 
has no outliers. 


Second Data Set 
* (>) - (IQR) = (z) - (11) = 16.5 
e Xmax — Q3 = 98 — 89 = 9 
e QI — Xmin = 78 — 25.5 = 52.5 


(3) . (IQR) = 16.5 is larger than 9 but smaller than 52.5, so for the 
second set 45 and 25.5 are outliers. 


To find the percentiles, create a frequency, relative frequency, and 


cumulative relative frequency chart. Get the percentiles from that chart. 
First Data Set 


¢ 30th %ile (between the 6th and 7th values) = eee = 19 


80th %ile (between the 16th and 17th values) = Meee) = $4.25 
Second Data Set 


e 30th %ile (7th value) = 78 
e 80th %ile (18th value) = 90 


30% of the data falls below the 30th %ile, and 20% falls above the 80th 
%ile. 


Example: 

Finding Quartiles and Percentiles Using a Table 

Fifty statistics students were asked how much sleep they get per school night 
(rounded to the nearest hour). The results were (student data): 


AMOUNT 

OF 

SLEEP 

PER 

SCHOOL CUMULATIVE 
NIGHT RELATIVE RELATIVE 
(HOURS) FREQUENCY FREQUENCY FREQUENCY 


4 2 0.04 0.04 
fs) fs) 0.10 0.14 
6 7 0.14 0.28 
7 12 0.24 0.52 


8 14 0.28 0.80 


AMOUNT 


OF 

SLEEP 

PER 

SCHOOL CUMULATIVE 
NIGHT RELATIVE RELATIVE 


(HOURS) FREQUENCY FREQUENCY FREQUENCY 
9 7 0.14 0.94 


10 3 0.06 1.00 


Find the 28th percentile: Notice the 0.28 in the "cumulative relative frequency" 
column. 28% of 50 data values = 14. There are 14 values less than the 28th %ile. 
They include the two 4s, the five 5s, and the seven 6s. The 28th %ile is between 
the last 6 and the first 7. The 28th %ile is 6.5. 

Find the median: Look again at the "cumulative relative frequency " column and 
find 0.52. The median is the 50th %ile or the second quartile. 50% of 50 = 25. 
There are 25 values less than the median. They include the two 4s, the five 5s, the 
seven 6s, and eleven of the 7s. The median or 50th %ile is between the 25th (7) 
and 26th (7) values. The median is 7. 

Find the third quartile: The third quartile is the same as the 75th percentile. You 
can "eyeball" this answer. If you look at the "cumulative relative frequency" 
column, you find 0.52 and 0.80. When you have all the 4s, 5s, 6s and 7s, you 
have 52% of the data. When you include all the 8s, you have 80% of the data. 
The 75th “%ile, then, must be an 8 . Another way to look at the problem is to 
find 75% of 50 (= 37.5) and round up to 38. The third quartile, Q 3, is the 38th 
value which is an 8. You can check this answer by counting the values. (There are 
37 values below the third quartile and 12 values above.) 


Example: 
Exercise: 


Problem: Using the table: 


1. Find the 80th percentile. 
2. Find the 90th percentile. 


3. Find the first quartile. What is another name for the first quartile? 
4. Construct a box plot of the data. 


Solution: 
1, 9 = 85 
2.9 
3.6 
4. First Quartile = 25th %ile 


Collaborative Classroom Exercise: Your instructor or a member of the class will 
ask everyone in class how many shirts they own. Answer the following questions. 


1. How many students were surveyed? 

2. What kind of sampling did you do? 

3. Find the mean and standard deviation. 

4. Find the mode. 

5. Construct 2 different histograms. For each, starting value = ending 
value = 

Find the median, first quartile, and third quartile. 

. Construct a box plot. 

8. Construct a table of the data to find the following: 


NO 


o The 10th percentile 
o The 70th percentile 
o The percent of students who own less than 4 shirts 


Interpreting Percentiles, Quartiles, and Median 

A percentile indicates the relative standing of a data value when data are sorted 
into numerical order, from smallest to largest. p% of data values are less than or 
equal to the pth percentile. For example, 15% of data values are less than or equal 
to the 15th percentile. 


¢ Low percentiles always correspond to lower data values. 
e High percentiles always correspond to higher data values. 


A percentile may or may not correspond to a value judgment about whether it is 
"good" or "bad". The interpretation of whether a certain percentile is good or bad 
depends on the context of the situation to which the data applies. In some 
situations, a low percentile would be considered "good'; in other contexts a high 
percentile might be considered "good". In many situations, there is no value 
judgment that applies. 


Understanding how to properly interpret percentiles is important not only when 
describing data, but is also important in later chapters of this textbook when 
calculating probabilities. 


Guideline: 


When writing the interpretation of a percentile in the context of the given data, the 
sentence should contain the following information: 


e information about the context of the situation being considered, 

e the data value (value of the variable) that represents the percentile, 

e the percent of individuals or items with data values below the percentile. 

e Additionally, you may also choose to state the percent of individuals or items 
with data values above the percentile. 


Example: 
On a timed math test, the first quartile for times for finishing the exam was 35 
minutes. Interpret the first quartile in the context of this situation. 


e 25% of students finished the exam in 35 minutes or less. 

e 75% of students finished the exam in 35 minutes or more. 

e A low percentile could be considered good, as finishing more quickly on a 
timed exam is desirable. (If you take too long, you might not be able to 
finish.) 


Example: 
On a 20 question math test, the 70th percentile for number of correct answers was 
16. Interpret the 70th percentile in the context of this situation. 


¢ 70% of students answered 16 or fewer questions correctly. 

¢ 30% of students answered 16 or more questions correctly. 

e Note: A high percentile could be considered good, as answering more 
questions correctly is desirable. 


Example: 

At a certain community college, it was found that the 30th percentile of credit 
units that students are enrolled for is 7 units. Interpret the 30th percentile in the 
context of this situation. 


¢ 30% of students are enrolled in 7 or fewer credit units 

e 70% of students are enrolled in 7 or more credit units 

e In this example, there is no "good" or "bad" value judgment associated with 
a higher or lower percentile. Students attend community college for varied 
reasons and needs, and their course load varies according to their needs. 


Do the following Practice Problems for Interpreting Percentiles 
Exercise: 


Problem: 


e a For runners in arace, a low time means a faster run. The winners in a 
race have the shortest running times. Is it more desirable to have a finish 
time with a high or a low percentile when running a race? 

¢ b The 20th percentile of run times in a particular race is 5.2 minutes. 
Write a sentence interpreting the 20th percentile in the context of the 
situation. 

e cA bicyclist in the 90th percentile of a bicycle race between two towns 
completed the race in 1 hour and 12 minutes. Is he among the fastest or 
slowest cyclists in the race? Write a sentence interpreting the 90th 
percentile in the context of the situation. 


Solution: 


e a Forrunners in arace it is more desirable to have a low percentile for 
finish time. A low percentile means a short time, which is faster. 

¢ bINTERPRETATION: 20% of runners finished the race in 5.2 minutes 
or less. 80% of runners finished the race in 5.2 minutes or longer. 

e cHe is among the slowest cyclists (90% of cyclists were faster than 
him.) INTERPRETATION: 90% of cyclists had a finish time of 1 hour, 
12 minutes or less.Only 10% of cyclists had a finish time of 1 hour, 12 
minutes or longer 


Exercise: 


Problem: 


¢ a For runners in a race, a higher speed means a faster run. Is it more 
desirable to have a speed with a high or a low percentile when running a 
race? 

e bThe 40th percentile of speeds in a particular race is 7.5 miles per hour. 
Write a sentence interpreting the 40th percentile in the context of the 
situation. 


Solution: 


e aFor runners in a race it is more desirable to have a high percentile for 
speed. A high percentile means a higher speed, which is faster. 

e bINTERPRETATION: 40% of runners ran at speeds of 7.5 miles per 
hour or less (slower). 60% of runners ran at speeds of 7.5 miles per hour 
or more (faster). 


Exercise: 
Problem: 


On an exam, would it be more desirable to earn a grade with a high or low 
percentile? Explain. 


Solution: 


On an exam you would prefer a high percentile; higher percentiles 
correspond to higher grades on the exam. 


** With contributions from Roberta Bloom 


Calculating IQR, Quantiles and Percentiles in R 


R has functions for calculating the interquartile range (IQR( )) and different 
percentiles (quantile( )). 


Note: There are several methods for calculating the IQR. The method R uses by 
default will not match the result of the real estate example demonstrated earlier; 
however, by using one of the alternative methods, it is possible to match the 
output. Both are demonstrated below. For more information, type 


?quantile 


at the R prompt and read the section titled "Types". 


# Enter the real estate data 

real.estate = c(389950, 230500, 158000, 479000, 639000, 
114950, 5500000, 387000, 659000, 529000, 575000, 
488800, 1095000) 

# R's default IQR type does not match. We were 

# expecting IQR = 340250 

IQR(real.estate) 

## [1] 252000 

# The R help file mentions that 'type = 6' is the 

# method used by SPSS and Minitab. We'll try that 

# method. It works! 

IQR(real.estate, type = 6) 

## [1] 340250 


# Quantiles. Default is 0, 25, 50, 75, and 100 
# %ile 

quantile(real.estate) 

Hi 0% 25% 50% 15% 100% 

## 114950 387000 488800 639000 5500000 


# What about by 10s instead? 
quantile(real.estate, probs = seq(0, 1, ©.1)) 


HH 0% 10% 20% 30% 40% 50% 
## 114950 172500 293100 388770 461190 488800 
HH 60% 70% 80% 90% 100% 


## 538200 600600 651000 1007800 5500000 


As can be seen, R is quite flexible at calculating percentiles. The key is to use the 
quantile( ) function in conjunction with specifying the probabilities that 
you're interested in. This is most easily done using Seq( ): the "sequence" 
function in R. Just be sure that your probabilities are all between "0" and "1"! 


Glossary 


Interquartile Range (IRQ) 
The distance between the third quartile (Q3) and the first quartile (Q1). IQR 
= Q3 - Ql. 


Outlier 
An observation that does not fit the rest of the data. 


Percentile 
A number that divides ordered data into hundredths. 


Example: 
Let a data set contain 200 ordered observations starting with 


{2.3,2.7,2.8,2.9,2.9,3.0...}. Then the first percentile is @7?*) — 9.75, 


because 1% of the data is to the left of this point on the number line and 99% of 


the data is on its right. The second percentile is cues) = 2.9. Percentiles may 


or may not be part of the data. In this example, the first percentile is not in the 
data, but the second percentile is. The median of the data is the second quartile 
and the 50th percentile. The first and third quartiles are the 25th and the 75th 
percentiles, respectively. 


Quartiles 
The numbers that separate the data into quarters. Quartiles may or may not be 
part of the data. The second quartile is the median of the data. 


Measuring the Center of the Data 
This chapter discusses measuring descriptive statistical information using the 
center of the data 


Mean and Median 


The "center" of a data set is also a way of describing location. The two most 
widely used measures of the "center" of the data are the mean (average) and the 
median. To calculate the mean weight of 50 people, add the 50 weights 
together and divide by 50. To find the median weight of the 50 people, order 
the data and find the number that splits the data into two equal parts (previously 
discussed under box plots in this chapter). The median is generally a better 
measure of the center when there are extreme values or outliers because it is not 
affected by the precise numerical values of the outliers. The mean is the most 
common measure of the center. 


Note:The words "mean" and "average" are often used interchangeably. The 
substitution of one word for the other is common practice. The technical term 
is "arithmetic mean" and "average" is technically a center location. However, 
in practice among non-statisticians, "average" is commonly accepted for 
"arithmetic mean.” 


The mean can also be calculated by multiplying each distinct value by its 
frequency and then dividing the sum by the total number of data values. The 
letter used to represent the sample mean is an x with a bar over it (pronounced 
"g bar"): x. 


The Greek letter jz (pronounced "mew" ) represents the population mean. One 
of the requirements for the sample mean to be a good estimate of the population 
mean is for the sample taken to be truly random. 


To see that both ways of calculating the mean are the same, consider the 
sample: 


1,1, 1, 2, 2, 3,4, 4,4, 4, 4 


Equation: 


1+14+14+2424+344+4+44+444+44 
SS SS SS SS 


2.7 
11 
Equation: 


3x1+2x24+1x3+5x4 
pe SS 
11 


2.7 


In the second example, the frequencies are 3, 2, 1, and 5. 


You can quickly find the location of the median by using the expression nee 
The letter n is the total number of data values in the sample. If n is an odd 
number, the median is the middle value of the ordered data (ordered smallest to 
largest). If m is an even number, the median is equal to the two middle values 
added together and divided by 2 after the data has been ordered. For example, if 
the total number of data values is 97, then —- = ant = 49. The median is the 
49th value in the ordered data. If the total number of data values is 100, then 
a = wa = 50.5. The median occurs midway between the 50th and 51st 
values. The location of the median and the value of the median are not the 
same. The upper case letter (V is often used to represent the median. The next 
example illustrates the location of the median and the value of the median. 


Example: 
Exercise: 


Problem: 


AIDS data indicating the number of months an AIDS patient lives after 
taking a new antibody drug are as follows (smallest to largest): 


rac! toc fap INO eet Ly Weigle alle tad 4 Peal es parlbboyay Mc egal L aliael Baty Bar Io otal bap aan Wa a4 
25, 26, 26, 27, 27, 29, 29, 31, 32, 33, 33, 34, 34, 35, 37, 40, 44, 44, 47 


Calculate the mean and the median. 


Solution: 


The calculation for the mean is: 


a [3+4+(8)(2)+10+11+12+13+414+(15)(2)+(16)(2)+...+35+37+40+ (44) (2)+47| 
_ 40 


= 20H 


To find the median, M, first use the formula for the location. The location 
is: 

He see te 
= A = 20.5 


Starting at the smallest value, the median is located between the 20th and 
21st values (the two 24s): 


34881011121314151516161717182122222424 
25262627272929313233333434353740444447 


Wes an 128 =A 


The median is 24. 


Calculating Mean and Median in R 


As you might expect, R has in-built functions for calculating mean (mean( )) 
and median (median( )). Additionally it provides a Summary () function 
that provides some basic descriptive statistics of the data. 


# Enter the AIDS data 

aids = c(3, 4, 8, 8, 10, 11, 12, 13, 14, 15, 15, 16, 
16, 17, 17, 18, 21, 22, 22, 24, 24, 25, 26, 26, 
27, 27, 29, 29, 31, 32, 33, 33, 34, 34, 35, 37, 
40, 44, 44, 47) 


# mean and median 
mean(aids) 


## [1] 23.57 


median(aids ) 
## [1] 24 
# summary statistics 
summary (aids) 
## Min. 1st Qu. Median Mean 3rd Qu. Max. 
HH 3.0 15.0 24.0 23.6 32.2 47.0 
Example: 
Exercise: 
Problem: 


Suppose that, in a small town of 50 people, one person earns $5,000,000 
per year and the other 49 each earn $30,000. Which is the better measure 
of the "center," the mean or the median? 


Solution: 
= ee ~ 129400 
M = 30000 


(There are 49 people who earn $30,000 and one person who earns 
$5,000,000.) 


The median is a better measure of the "center" than the mean because 49 


of the values are 30,000 and one is 5,000,000. The 5,000,000 is an outlier. 
The 30,000 gives us a better sense of the middle of the data. 


Mode 


Another measure of the center is the mode. The mode is the most frequent 
value. If a data set has two values that occur the same number of times, then the 


set is bimodal. 


Example: 

Statistics exam scores for 20 students are as follows 

Statistics exam scores for 20 students are as follows: 

DO 8.50 Doe Os. OS 6 2 en I a 2 Ge ONO lO G4, Cae Oa Use 
Exercise: 


Problem:Find the mode. 
Solution: 


The most frequent score is 72, which occurs five times. Mode = 72. 


Calculating the Mode in R 


R does not have an in-built function for calculating the mode, in part because a 
set of data is likely to have more than one mode. However, it is very easy to 
tabulate your data and see which value or values occur most frequently. 


# Enter the exam scores data 
exam.scores = c(50, 53, 59, 59, 63, 63, 72, 72, 72, 
72, 72, 76, 78, 81, 83, 84, 84, 84, 90, 93) 


# Use 'table' to find the mode 
table(exam.scores ) 

## exam.scores 

## 50 53 59 63 72 76 78 81 83 84 90 93 
fe 2 OB Sd OS I dE 


# That output is difficult to scan. Sort the 
# output to make it easier to identify the most 
# frequently occurring value. 


sort(table(exam.scores) ) 

## exam.scores 

## 50 53 76 78 81 83 90 93 59 63 84 72 
Pere: ko i aL “abe Sle a) le Ae od 2) Se. 3S 
# 72 occurs 5 times, making it the mode. 


Example: 

Five real estate exam scores are 430, 430, 480, 480, 495. The data set is 
bimodal because the scores 430 and 480 each occur twice. 

When is the mode the best measure of the "center"? Consider a weight loss 
program that advertises a mean weight loss of six pounds the first week of the 
program. The mode might indicate that most people lose two pounds the first 
week, making the program less appealing. 


Note:The mode can be calculated for qualitative data as well as for 
quantitative data. 


Statistical software will easily calculate the mean, the median, and the mode. 
Some graphing calculators can also make these calculations. In the real world, 
people make these calculations using software. 


The Law of Large Numbers and the Mean 


The Law of Large Numbers says that if you take samples of larger and larger 
size from any population, then the mean z of the sample is very likely to get 
closer and closer to p. This is discussed in more detail in The Central Limit 
Theorem. 


Note:The formula for the mean is located in the Summary of Formulas section 
course. 


Sampling Distributions and Statistic of a Sampling Distribution 


You can think of a sampling distribution as a relative frequency distribution 
with a great many samples. (See Sampling and Data for a review of relative 
frequency). Suppose thirty randomly selected students were asked the number 
of movies they watched the previous week. The results are in the relative 
frequency table shown below. 


# of movies Relative Frequency 
0 5/30 

1 15/30 

2 6/30 

3 4/30 

4 1/30 


If you let the number of samples get very large (say, 300 million or more), 
the relative frequency table becomes a relative frequency distribution. 


A statistic is a number calculated from a sample. Statistic examples include the 


mean, the median and the mode as well as others. The sample mean z is an 
example of a statistic which estimates the population mean j. 


Glossary 


Mean 
A number that measures the central tendency. A common name for mean 
is ‘average.’ The term 'mean' is a shortened form of ‘arithmetic mean.' By 
definition, the mean for a sample (denoted by 2) is 


__ Sum of all values in the sample F 
on anparer vale athe enlS? and the mean for a population (denoted by 


) : __ Sum of all values in the population 
H) 1S {ft = ‘Number of values in the population * 


Median 
A number that separates ordered data into halves. Half the values are the 
same number or smaller than the median and half the values are the same 
number or larger than the median. The median may or may not be part of 
the data. 


Mode 
The value that appears most frequently in a set of data. 


Skewness and the Mean, Median, and Mode 
Consider the following data set: 
456667777778836910 


This data set produces the histogram shown below. Each interval has width 
one and each value is located in the middle of an interval. 


4 5 6 7 8 9 10 


The histogram displays a symmetrical distribution of data. A distribution is 
symmetrical if a vertical line can be drawn at some point in the histogram 
such that the shape to the left and the right of the vertical line are mirror 
images of each other. The mean, the median, and the mode are each 7 for 
these data. In a perfectly symmetrical distribution, the mean and the 
median are the same. This example has one mode (unimodal) and the 
mode is the same as the mean and median. In a symmetrical distribution 
that has two modes (bimodal), the two modes would be different from the 
mean and median. 


The histogram for the data: 
4566677778 


is not symmetrical. The right-hand side seems "chopped off" compared to 
the left side. The shape distribution is called skewed to the left because it is 
pulled out to the left. 


4 5 6 7 8 


The mean is 6.3, the median is 6.5, and the mode is 7. Notice that the 
mean is less than the median and they are both less than the mode. The 
mean and the median both reflect the skewing but the mean more so. 


The histogram for the data: 
677778886910 


is also not symmetrical. It is skewed to the right. 


6 7 8 9 10 


The mean is 7.7, the median is 7.5, and the mode is 7. Of the three statistics, 
the mean is the largest, while the mode is the smallest. Again, the mean 
reflects the skewing the most. 


To summarize, generally if the distribution of data is skewed to the left, the 
mean is less than the median, which is often less than the mode. If the 
distribution of data is skewed to the right, the mode is often less than the 
median, which is less than the mean. 


Skewness and symmetry become important when we discuss probability 
distributions in later chapters. 


Measuring the Spread of the Data 

Descriptive Statistics: Measuring the Spread of Data explains standard deviation as a measure of variation in data 
and is part of the collection col10555 written by Barbara Illowsky and Susan Dean. Roberta Bloom made 
contributions that helped to clarify the standard deviation and the variance. 


An important characteristic of any set of data is the variation in the data. In some data sets, the data values are 
concentrated closely near the mean; in other data sets, the data values are more widely spread out from the mean. 
The most common measure of variation, or spread, is the standard deviation. 


The standard deviation is a number that measures how far data values are from their mean. 
The standard deviation 


e provides a numerical measure of the overall amount of variation in a data set 
e can be used to determine whether a particular data value is close to or far from the mean 


The standard deviation provides a measure of the overall variation in a data set 

The standard deviation is always positive or 0. The standard deviation is small when the data are all concentrated 
close to the mean, exhibiting little variation or spread. The standard deviation is larger when the data values are 
more spread out from the mean, exhibiting more variation. 


Suppose that we are studying waiting times at the checkout line for customers at supermarket A and supermarket 
B; the average wait time at both markets is 5 minutes. At market A, the standard deviation for the waiting time is 2 
minutes; at market B the standard deviation for the waiting time is 4 minutes. 


Because market B has a higher standard deviation, we know that there is more variation in the waiting times at 
market B. Overall, wait times at market B are more spread out from the average; wait times at market A are more 
concentrated near the average. 


The standard deviation can be used to determine whether a data value is close to or far from the mean. 
Suppose that Rosa and Binh both shop at Market A. Rosa waits for 7 minutes and Binh waits for 1 minute at the 
checkout counter. At market A, the mean wait time is 5 minutes and the standard deviation is 2 minutes. The 
standard deviation can be used to determine whether a data value is close to or far from the mean. 


Rosa waits for 7 minutes: 


e 7 is 2 minutes longer than the average of 5; 2 minutes is equal to one standard deviation. 
e Rosa's wait time of 7 minutes is 2 minutes longer than the average of 5 minutes. 
e Rosa's wait time of 7 minutes is one standard deviation above the average of 5 minutes. 


Binh waits for 1 minute. 


e 1is 4 minutes less than the average of 5; 4 minutes is equal to two standard deviations. 

e Binh's wait time of 1 minute is 4 minutes less than the average of 5 minutes. 

e Binh's wait time of 1 minute is two standard deviations below the average of 5 minutes. 

e A data value that is two standard deviations from the average is just on the borderline for what many 
statisticians would consider to be far from the average. Considering data to be far from the mean if it is more 
than 2 standard deviations away is more of an approximate "rule of thumb" than a rigid rule. In general, the 
shape of the distribution of the data affects how much of the data is further away than 2 standard deviations. 
(We will learn more about this in later chapters.) 


e In general, a value = mean + (#ofSTDEV)(standard deviation) 

e where #0fSTDEVs = the number of standard deviations 

e 7 is one standard deviation more than the mean of 5 because: 7=5+(1)(2) 
e 1 is two standard deviations less than the mean of 5 because: 1=5+(—2)(2) 


The equation value = mean + (#o0fSTDEVs)(standard deviation) can be expressed for a sample and for a 
population: 


¢ sample: z = x + (##ofSTDEV)(s) 
¢ Population: « = + (##ofSTDEV)(o) 


The lower case letter s represents the sample standard deviation and the Greek letter o (sigma, lower case) 
represents the population standard deviation. 


The symbol z is the sample mean and the Greek symbol yz is the population mean. 


Calculating the Standard Deviation 

If x is a number, then the difference "x - mean" is called its deviation. In a data set, there are as many deviations 
as there are items in the data set. The deviations are used to calculate the standard deviation. If the numbers belong 
to a population, in symbols a deviation is x — yw. For sample data, in symbols a deviation is z— z . 


The procedure to calculate the standard deviation depends on whether the numbers are the entire population or are 
data from a sample. The calculations are similar, but not identical. Therefore the symbol used to represent the 
standard deviation depends on whether it is calculated from a population or a sample. The lower case letter s 
represents the sample standard deviation and the Greek letter o (sigma, lower case) represents the population 
standard deviation. If the sample has the same characteristics as the population, then s should be a good estimate 
of o. 


To calculate the standard deviation, we need to calculate the variance first. The variance is an average of the 
squares of the deviations (the x— 2 values for a sample, or the x — pu values for a population). The symbol o? 
represents the population variance; the population standard deviation a is the square root of the population 
variance. The symbol s? represents the sample variance; the sample standard deviation s is the square root of the 
sample variance. You can think of the standard deviation as a special average of the deviations. 


If the numbers come from a census of the entire population and not a sample, when we calculate the average of 
the squared deviations to find the variance, we divide by N, the number of items in the population. If the data are 
from a sample rather than a population, when we calculate the average of the squared deviations, we divide by n- 
1, one less than the number of items in the sample. You can see that in the formulas below. 


Formulas for the Sample Standard Deviation 


se 2 
+= EEE ore = HE 


e For the sample standard deviation, the denominator is n-1, that is the sample size MINUS 1. 


Formulas for the Population Standard Deviation 


2 2 
eee J FEW) op g = J Et(e-w) 
e For the population standard deviation, the denominator is N, the number of items in the population. 


In these formulas, f represents the frequency with which a value appears. For example, if a value appears once, f 
is 1. If a value appears three times in the data set or population, f is 3. 


Sampling Variability of a Statistic 

The statistic of a sampling distribution was discussed in Descriptive Statistics: Measuring the Center of the 
Data. How much the statistic varies from one sample to another is known as the sampling variability of a 
statistic. You typically measure the sampling variability of a statistic by its standard error. The standard error of 
the mean is an example of a standard error. It is a special standard deviation and is known as the standard 
deviation of the sampling distribution of the mean. You will cover the standard error of the mean in The Central 


Limit Theorem (not now). The notation for the standard error of the mean is a where a is the standard 
n 


deviation of the population and n is the size of the sample. 


Example: 

In a fifth grade class, the teacher was interested in the average age and the sample standard deviation of the ages 
of her students. The following data are the ages fora SAMPLE of n = 20 fifth grade students. The ages are 
rounded to the nearest half year: 

8), Bu5s, SS, 10, TO), TIO}, M0), MOS), OS), OS). Sy, TL, iL, ili IL iL, Ta, MLSs), TLS), TLS) 

Equation: 


9+9.5x2+10x4+4+10.5x4+11x6+411.5 x 3 
Cc 


= 10.52 
20 0.525 


The average age is 10.53 years, rounded to 2 places. 
The variance may be calculated by using a table. Then the standard deviation is calculated by taking the square 
root of the variance. We will explain the parts of the table after calculating s. 


Data Freq. Deviations Deviations? (Freq.)(Deviations’) 

2 |r | @-) Gay (le —2)? 

9 1 9 — 10.525 = —1.525 (—1.525)? = 2.325625 1 x 2.325625 = 2.325625 
9.5 2 9.5 — 10.525 = —1.025 (—1.025)” = 1.050625 2 x 1.050625 = 2.101250 
10 4 10 — 10.525 = —0.525 (—0.525)” = 0.275625 4 x .275625 = 1.1025 
10.5 4 10.5 — 10.525 = —0.025 (—0.025)? = 0.000625 4 x .000625 = .0025 

11 6 11 — 10.525 = 0.475 (0.475)° = 0.225625 6 x .225625 = 1.35375 
11.5 3 11.5 — 10.525 = 0.975 (0.975)" = 0.950625 3 x .950625 = 2.851875 


The sample variance, s”, is equal to the sum of the last column (9.7375) divided by the total number of data 
values minus one (20 - 1): 


3? = SB = 0.5125 


The sample standard deviation s is equal to the square root of the sample variance: 
s = V0.5125 =. 0715891 Rounded to two decimal places, s = 0.72 


e For the following problems, recall that value = mean + (#0ofSTDEVs)(standard deviation) 
e Forasample: z = x + (#ofSTDEVs)(s) 

e Fora population: x = pz + (#ofSTDEVs)( oc) 

e For this example, use x = x + (4ofSTDEVs)(s) because the data is from a sample 


Exercise: 


Problem: Find the value that is 1 standard deviation above the mean. Find (x + 1s). 


Solution: 


(x + 1s) = 10.53 + (1)(0.72) = 11.25 


Exercise: 


Problem: Find the value that is two standard deviations below the mean. Find (x — 2s). 


Solution: 


(x — 2s) = 10.53 — (2)(0.72) = 9.09 


Exercise: 


Problem: Find the values that are 1.5 standard deviations from (below and above) the mean. 
Solution: 


° (x —1.5s) = 10.53 — (1.5)(0.72) = 9.45 
° (« +1.5s) = 10.53 + (1.5)(0.72) = 11.61 


Explanation of the standard deviation calculation shown in the table 

The deviations show how spread out the data are about the mean. The data value 11.5 is farther from the mean than 
is the data value 11. The deviations 0.97 and 0.47 indicate that. A positive deviation occurs when the data value is 
greater than the mean. A negative deviation occurs when the data value is less than the mean; the deviation is 
-1.525 for the data value 9. If you add the deviations, the sum is always zero. (For this example, there are n=20 
deviations.) So you cannot simply add the deviations to get the spread of the data. By squaring the deviations, you 
make them positive numbers, and the sum will also be positive. The variance, then, is the average squared 
deviation. 


The variance is a squared measure and does not have the same units as the data. Taking the square root solves the 
problem. The standard deviation measures the spread in the same units as the data. 


Notice that instead of dividing by n=20, the calculation divided by n-1=20-1=19 because the data is a sample. For 
the sample variance, we divide by the sample size minus one (n-1). Why not divide by n? The answer has to do 
with the population variance. The sample variance is an estimate of the population variance. Based on the 
theoretical mathematics that lies behind these calculations, dividing by (n-1) gives a better estimate of the 
population variance. 


The standard deviation, s or o, is either zero or larger than zero. When the standard deviation is 0, there is no 
spread; that is, the all the data values are equal to each other. The standard deviation is small when the data are all 
concentrated close to the mean, and is larger when the data values show more variation from the mean. When the 
standard deviation is a lot larger than zero, the data values are very spread out about the mean; outliers can make s 
or o very large. 


The standard deviation, when first presented, can seem unclear. By graphing your data, you can get a better "feel" 
for the deviations and the standard deviation. You will find that in symmetrical distributions, the standard deviation 
can be very helpful but in skewed distributions, the standard deviation may not be much help. The reason is that 
the two sides of a skewed distribution have different spreads. In a skewed distribution, it is better to look at the 
first quartile, the median, the third quartile, the smallest value, and the largest value. Because numbers can be 
confusing, always graph your data. 


Example: 
Exercise: 


Problem: Use the following data (first exam scores) from Susan Dean's spring pre-calculus class: 


33, 42, 49, 49, 53, 55, 55, 61, 63, 67, 68, 68, 69, 69, 72, 73, 74, 78, 80, 83, 88, 88, 88, 90, 92, 94, 94, 94, 94, 
96, 100 


e aCreate a chart containing the data, frequencies, relative frequencies, and cumulative relative 
frequencies to three decimal places. 
e bCalculate the following to one decimal place: 


iThe sample mean 

iiThe sample standard deviation 
iiiThe median 

ivThe first quartile 

vthe third quartile 

vilQR 


OF 10) ORO O70) 


e cConstruct a box plot and a histogram on the same set of axes. Make comments about the box plot, the 
histogram, and the chart. 


Solution: 
ea 

Data Frequency Relative Frequency Cumulative Relative Frequency 
38) il 0.032 0.032 
42 1 0.032 0.064 
49 2 0.065 0.129 
53 il 0.032 0.161 
55 2 0.065 0.226 
61 il 0.032 0.258 
63 il 0.032 0.29 
67 il 0.032 0.322 
68 2 0.065 0.387 
69 2 0.065 0.452 
YD 1 0.032 0.484 
73) 1 0.032 0.516 
74 1 0.032 0.548 


78 1 0.032 0.580 


Data Frequency Relative Frequency Cumulative Relative Frequency 


80 1 0.032 0.612 

83 1 0.032 0.644 

88 3 0.097 0.741 

90 il 0.032 0.773 

92 il 0.032 0.805 

94 4 0.129 0.934 

96 il 0.032 0.966 

100 il 0.032 0.998 (Why isn't this value 1?) 
eb 

o iThe sample mean = 73.5 

© iiThe sample standard deviation = 17.9 

o iiiThe median = 73 

© ivThe first quartile = 61 

o vThe third quartile = 90 

© viIQR = 90 - 61 = 29 


e cThe x-axis goes from 32.5 to 100.5; y-axis goes from -2.4 to 15 for the histogram; number of intervals 
is 5 for the histogram so the width of an interval is (100.5 - 32.5) divided by 5 which is equal to 13.6. 
Endpoints of the intervals: starting point is 32.5, 32.5+13.6 = 46.1, 46.1+13.6 = 59.7, 59.7+13.6 = 73.3, 
73.3+13.6 = 86.9, 86.9+13.6 = 100.5 = the ending value; No data values fall on an interval boundary. 


The long left whisker in the box plot is reflected in the left side of the histogram. The spread of the exam scores in 
the lower 50% is greater (73 - 33 = 40) than the spread in the upper 50% (100 - 73 = 27). The histogram, box plot, 
and chart all reflect this. There are a substantial number of A and B grades (80s, 90s, and 100). The histogram 
clearly shows this. The box plot shows us that the middle 50% of the exam scores (IQR = 29) are Ds, Cs, and Bs. 
The box plot also shows us that the lower 25% of the exam scores are Ds and Fs. 


Comparing Values from Different Data Sets 


The standard deviation is useful when comparing data values that come from different data sets. If the data sets 
have different means and standard deviations, it can be misleading to compare the data values directly. 


e For each data value, calculate how many standard deviations the value is away from its mean. 


e Use the formula: value = mean + (#ofSTDEVs)(standard deviation); solve for #ofSTDEVs. 
lue— 

‘ #ofSTDEVs - Sans aentice 

¢ Compare the results of this calculation. 


#ofSTDEVs is often called a "z-score"; we can use the symbol z. In symbols, the formulas become: 


= — £-f 
Sample L=XL+ZS aa 
: = _ &-p 
Population L=p~t+zo z=— 
Example: 
Exercise: 
Problem: 


Two students, John and Ali, from different high schools, wanted to find out who had the highest G.P.A. when 
compared to his school. Which student had the highest G.P.A. when compared to his school? 


Student GPA School Mean GPA School Standard Deviation 
John 2.85 3.0 0.7 
Ali 77 80 10 

Solution: 


For each student, determine how many standard deviations (#ofSTDEVs) his GPA is away from the average, 
for his school. Pay careful attention to signs when comparing and interpreting the answer. 


#ofSTDEVs aS value—mean 12= tp 


standard deviation o 


For John, z = #ofSTDEVs = 2-8° — —0.21 


ForAli, 7— qos i DEVs — “293 


John has the better G.P.A. when compared to his school because his G.P.A. is 0.21 standard deviations below 
his school's mean while Ali's G.P.A. is 0.3 standard deviations below his school's mean. 


John's z-score of —0.21 is higher than Ali's z-score of —0.3 . For GPA, higher values are better, so we 
conclude that John has the better GPA when compared to his school. 


The following lists give a few facts that provide a little more insight into what the standard deviation tells us about 
the distribution of the data. 
For ANY data set, no matter what the distribution of the data is: 


e At least 75% of the data is within 2 standard deviations of the mean. 

e At least 89% of the data is within 3 standard deviations of the mean. 

e At least 95% of the data is within 4 1/2 standard deviations of the mean. 
e This is known as Chebyshev's Rule. 


For data having a distribution that is MOUND-SHAPED and SYMMETRIC: 


e Approximately 68% of the data is within 1 standard deviation of the mean. 

e Approximately 95% of the data is within 2 standard deviations of the mean. 

¢ More than 99% of the data is within 3 standard deviations of the mean. 

e This is known as the Empirical Rule. 

e It is important to note that this rule only applies when the shape of the distribution of the data is mound- 
shaped and symmetric. We will learn more about this when studying the "Normal" or "Gaussian" probability 
distribution in later chapters. 


**With contributions from Roberta Bloom 


Variance and Standard Deviation in R 


As can be expected, there are functions in R to calculate the variance (var ( )) and standard deviation (Sd ( )). 
Here, we will calculate the variance and standard deviation of the "ages" data presented earlier in this chapter. 


# Enter the ages data 
ages = c(9, 9.5, 9.5, 10, 10, 10, 10, 10.5, 10.5, 10.5, 
10.5, 14, 11, 14, 14, 44, 14, 11.5, 11.5, 11.5) 


# variance 
var (ages) 
## [1] 0.5125 


# standard deviation 
sd(ages) 
## [1] 0.7159 


Glossary 


Standard Deviation 
A number that is equal to the square root of the variance and measures how far data values are from their 
mean. Notation: s for sample standard deviation and o for population standard deviation. 


Variance 
Mean of the squared deviations from the mean. Square of the standard deviation. For a set of data, a deviation 
can be represented as x — x where z is a value of the data and z is the sample mean. The sample variance is 
equal to the sum of the squares of the deviations divided by the difference of the sample size and 1. 


Practice 1: Center of the Data 

This module provides students with opportunities to apply concepts related 
to descriptive statistics. Students are asked to take a set of sample data and 
calculate a series of statistical values for that data. 


Student Learning Outcomes 


e The student will calculate and interpret the center, spread, and location 
of the data. 
e The student will construct and interpret histograms an box plots. 


Given 


Sixty-five randomly selected car salespersons were asked the number of 
cars they generally sell in one week. Fourteen people answered that they 
generally sell three cars; nineteen generally sell four cars; twelve generally 
sell five cars; nine generally sell six cars; eleven generally sell seven cars. 


Complete the Table 


Cumulative 
Data Value Relative Relative 
(# cars) Frequency Frequency Frequency 


Discussion Questions 


Exercise: 


Problem: What does the frequency column sum to? Why? 


Solution: 
65 


Exercise: 


Problem: What does the relative frequency column sum to? Why? 


Solution: 


1 
Exercise: 


Problem: 


What is the difference between relative frequency and frequency for 
each data value? 


Exercise: 


Problem: 


What is the difference between cumulative relative frequency and 
relative frequency for each data value? 


Enter the Data 


Enter your data into your calculator or computer. 


Construct a Histogram 


Determine appropriate minimum and maximum x and y values and the 
scaling. Sketch the histogram below. Label the horizontal and vertical axes 
with words. Include numerical scaling. 


Data Statistics 


Calculate the following values: 
Exercise: 


Problem: Sample mean= = 


Solution: 


4.75 


Exercise: 


Problem: Sample standard deviation= = 


Solution: 
1.39 
Exercise: 


Problem: Sample size= = 


Solution: 


65 


Calculations 


Use the table in section 2.11.3 to calculate the following values: 
Exercise: 


Problem: Median = 


Solution: 
4 
Exercise: 


Problem: Mode = 


Solution: 
4 
Exercise: 


Problem: First quartile = 


Solution: 
4 
Exercise: 


Problem: Second quartile = median = 50th percentile = 


Solution: 


4 


Exercise: 


Problem: Third quartile = 


Solution: 


6 


Exercise: 


Problem: [nterquartile range ( )= - 


Solution: 


Exercise: 


Problem: 10th percentile = 


Solution: 
3 


Exercise: 


Problem: 70th percentile = 
Solution: 
6 
Exercise: 
Problem: Find the value that is 3 standard deviations: 


e aAbove the mean 
e bBelow the mean 


Solution: 


e a8.93 


e b0.58 


Box Plot 


Construct a box plot below. Use a ruler to measure and scale accurately. 


Interpretation 


Looking at your box plot, does it appear that the data are concentrated 
together, spread out evenly, or concentrated in some areas, but not in 
others? How can you tell? 


Practice 2: Spread of the Data 
Practice exercise for Descriptive Statistics 


Student Learning Outcomes 


e The student will calculate measures of the center of the data. 
e The student will calculate the spread of the data. 


Given 


The population parameters below describe the full-time equivalent number of 
students (FTES) each year at Lake Tahoe Community College from 1976-77 
through 2004-2005. (Source: Graphically Speaking by Bill King, LTCC 
Institutional Research, December 2005). 


Use these values to answer the following questions: 


e =1000 FTES 
e Median = 1014 FTES 
= 474 FTES 
First quartile = 528.5 FTES 
e Third quartile = 1447.5 FTES 
e =29 years 


Calculate the Values 


Exercise: 


Problem: 


A sample of 11 years is taken. About how many are expected to have a 
FTES of 1014 or above? Explain how you determined your answer. 


Solution: 


6 


Exercise: 


Problem: 75% of all years have a FTES: 
e aAt or below: 
e bAt or above: 

Solution: 


e al447.5 
¢ b528.5 


Exercise: 


Problem: The population standard deviation = 


Solution: 


474 FTES 
Exercise: 


Problem: 


What percent of the FTES were from 528.5 to 1447.5? How do you 
know? 


Solution: 
50% 
Exercise: 
Problem: What is the ? What does the represent? 


Solution: 


319 


Exercise: 


Problem: 
How many standard deviations away from the mean is the median? 
Solution: 


0.03 


Additional Information: The population FTES for 2005-2006 through 2010- 
2011 was given in an updated report. (Source: 
http://www.ltcc.edu/data/ResourcePDF/LTCC_FactBook_2010-11.pdf). The 
data are reported here. 


2005- 2006- 2007- 2008- 2009- 2010- 


xear 06 07 08 09 10 14 
Total 1585 1690 1735 1935 2021 1890 
FTES 

Exercise: 
Problem: 


Calculate the mean, median, standard deviation, first quartile, the third 
quartile and the IQR. Round to one decimal place. 


Solution: 


mean = 1809.3 

median = 1812.5 

standard deviation = 151.2 
First quartile = 1690 


Third quartile = 1935 
TQR = 245 


Exercise: 
Problem: 
Construct a boxplot for the FTES for 2005-2006 through 2010-2011 and 
a boxplot for the FTES for 1976-1977 through 2004-2005. 
Exercise: 
Problem: 
Compare the IQR for the FTES for 1976-77 through 2004-2005 with the 


IQR for the FTES for 2005-2006 through 2010-2011. Why do you 
suppose the IQRs are so different? 


Solution: 


Hint: Think about the number of years covered by each time period and 
what happened to higher education during those periods. 


Probability Topics 
This module introduces the concept of Probability, the chance of an event 
occurring. 


Student Learning Outcomes 
By the end of this chapter, the student should be able to: 


e Understand and use the terminology of probability. 

e Determine whether two events are mutually exclusive and whether two 
events are independent. 

¢ Calculate probabilities using the Addition Rules and Multiplication 
Rules. 

e Construct and interpret Contingency Tables. 

e Construct and interpret Venn Diagrams (optional). 

e Construct and interpret Tree Diagrams (optional). 


Introduction 


It is often necessary to "guess" about the outcome of an event in order to 
make a decision. Politicians study polls to guess their likelihood of winning 
an election. Teachers choose a particular course of study based on what they 
think students can comprehend. Doctors choose the treatments needed for 
various diseases based on their assessment of likely results. You may have 
visited a casino where people play games chosen because of the belief that 
the likelihood of winning is good. You may have chosen your course of 
study based on the probable availability of jobs. 


You have, more than likely, used probability. In fact, you probably have an 
intuitive sense of probability. Probability deals with the chance of an event 
occurring. Whenever you weigh the odds of whether or not to do your 
homework or to study for an exam, you are using probability. In this 
chapter, you will learn to solve probability problems using a systematic 
approach. 


Optional Collaborative Classroom Exercise 


Your instructor will survey your class. Count the number of students in the 
class today. 


e Raise your hand if you have any change in your pocket or purse. 
Record the number of raised hands. 

e Raise your hand if you rode a bus within the past month. Record the 
number of raised hands. 

e Raise your hand if you answered "yes" to BOTH of the first two 
questions. Record the number of raised hands. 


Use the class data as estimates of the following probabilities. P(change) 
means the probability that a randomly chosen person in your class has 
change in his/her pocket or purse. P(bus) means the probability that a 
randomly chosen person in your class rode a bus within the last month and 
so on. Discuss your answers. 


e Find P(change). 

e Find P(bus). 

e Find P(change and bus) Find the probability that a randomly chosen 
student in your class has change in his/her pocket or purse and rode a 
bus within the last month. 

e Find P(change| bus) Find the probability that a randomly chosen 
student has change given that he/she rode a bus within the last month. 
Count all the students that rode a bus. From the group of students who 
rode a bus, count those who have change. The probability is equal to 
those who have change and rode a bus divided by those who rode a 
bus. 


Terminology 

Probability: Terminology is part of the collection col10555 written by 
Barbara Illowsky and Susan Dean defines key terms related to Probability 
and has contributions from Roberta Bloom. 


Probability is a measure that is associated with how certain we are of 
outcomes of a particular experiment or activity. An experiment is a 
planned operation carried out under controlled conditions. If the result is 
not predetermined, then the experiment is said to be a chance experiment. 
Flipping one fair coin twice is an example of an experiment. 


The result of an experiment is called an outcome. A sample space is a set 
of all possible outcomes. Three ways to represent a sample space are to list 
the possible outcomes, to create a tree diagram, or to create a Venn diagram. 
The uppercase letter S' is used to denote the sample space. For example, if 
you flip one fair coin, S = {H, T} where H = heads and T = tails are the 
outcomes. 


An event is any combination of outcomes. Upper case letters like A and B 
represent events. For example, if the experiment is to flip one fair coin, 
event A might be getting at most one head. The probability of an event A is 
written P(A). 


The probability of any outcome is the long-term relative frequency of 
that outcome. Probabilities are between 0 and 1, inclusive (includes 0 and 
1 and all numbers between these values). P(A) = 0 means the event A can 
never happen. P(A) = 1 means the event A always happens. P(A) = 0.5 
means the event A is equally likely to occur or not to occur. For example, if 
you flip one fair coin repeatedly (from 20 to 2,000 to 20,000 times) the 
relative fequency of heads approaches 0.5 (the probability of heads). 


Equally likely means that each outcome of an experiment occurs with 
equal probability. For example, if you toss a fair, six-sided die, each face 
(1, 2, 3, 4, 5, or 6) is as likely to occur as any other face. If you toss a fair 
coin, a Head(H) and a Tail(T) are equally likely to occur. If you randomly 
guess the answer to a true/false question on an exam, you are equally likely 
to select a correct answer or an incorrect answer. 


To calculate the probability of an event A when all outcomes in the 
sample space are equally likely, count the number of outcomes for event 
A and divide by the total number of outcomes in the sample space. For 
example, if you toss a fair dime and a fair nickel, the sample space is 
{HH, TH, HT, TT} where T = tails and H = heads. The sample space 
has four outcomes. A = getting one head. There are two outcomes 

{HT, TH}. P(A) =2. 
Suppose you roll one fair six-sided die, with the numbers {1,2,3,4,5,6} on 
its faces. Let event & = rolling a number that is at least 5. There are two 
outcomes {5, 6}. P(E) =2. If you were to roll the die only a few times, 


you would not be surprised if your observed results did not match the 
probability. If you were to roll the die a very large number of times, you 
would expect that, overall, 2/6 of the rolls would result in an outcome of "at 
least 5". You would not expect exactly 2/6. The long-term relative 
frequency of obtaining this result would approach the theoretical probability 
of 2/6 as the number of repetitions grows larger and larger. 


This important characteristic of probability experiments is the known as the 
Law of Large Numbers: as the number of repetitions of an experiment is 
increased, the relative frequency obtained in the experiment tends to 
become closer and closer to the theoretical probability. Even though the 
outcomes don't happen according to any set pattern or order, overall, the 
long-term observed relative frequency will approach the theoretical 
probability. (The word empirical is often used instead of the word 
observed.) The Law of Large Numbers will be discussed again in Chapter 7. 


It is important to realize that in many situations, the outcomes are not 
equally likely. A coin or die may be unfair, or biased . Two math 
professors in Europe had their statistics students test the Belgian 1 Euro 
coin and discovered that in 250 trials, a head was obtained 56% of the time 
and a tail was obtained 44% of the time. The data seem to show that the 
coin is not a fair coin; more repetitions would be helpful to draw a more 
accurate conclusion about such bias. Some dice may be biased. Look at the 
dice in a game you have at home; the spots on each face are usually small 
holes carved out and then painted to make the spots visible. Your dice may 
or may not be biased; it is possible that the outcomes may be affected by the 


slight weight differences due to the different numbers of holes in the faces. 
Gambling casinos have a lot of money depending on outcomes from rolling 
dice, so casino dice are made differently to eliminate bias. Casino dice have 
flat faces; the holes are completely filled with paint having the same density 
as the material that the dice are made out of so that each face is equally 
likely to occur. Later in this chapter we will learn techniques to use to work 
with probabilities for events that are not equally likely. 


"OR" Event: 

An outcome is in the event A OR B if the outcome is in A or is in B or is 
in both A and B. For example, let A = {1, 2, 3, 4, 5} and 

B= {4, 5,6, 7,8}. A OR B = (1, 2, 3, 4, 5, 6, 7, 8}. Notice that 4 
and 5 are NOT listed twice. 


"AND" Event: 

An outcome is in the event A AND B if the outcome is in both A and B at 
the same time. For example, let A and B be {1, 2, 3, 4, 5} and 

{4, 5, 6, 7, 8}, respectively. Then A AND B = {4, 5}. 


The complement of event A is denoted A’ (read "A prime"). A’ consists of 
all outcomes that are NOT in A. Notice that P(A) + P(A’) = 1. For 
example, let S = {1, 2, 3, 4, 5, 6} and let A = {1, 2, 3, 4}. Then, 

A’ = {5, 6}. P(A) =4, P(A’) =2, and P(A) + P(A’) =4+2=1 
The conditional probability of A given B is written P(A|B). P(A|B) is 
the probability that event A will occur given that the event B has already 
occurred. A conditional reduces the sample space. We calculate the 
probability of A from the reduced sample space B. The formula to calculate 
P(A|B) is 


P(A|B)= SST a B) 


where P(B) is greater than 0. 
For example, suppose we toss one fair, six-sided die. The sample space 


S = {1, 2, 3, 4, 5, 6}. Let A = face is 2 or 3 and B = face is even (2, 4, 6). 
To calculate P(A|B), we count the number of outcomes 2 or 3 in the 


sample space B = {2, 4, 6}. Then we divide that by the number of 
outcomes in B (and not S). 


We get the same result by using the formula. Remember that S has 6 
outcomes. 


P(A|B) = 
P(A andB) __ (the number of outcomes that are 2 or 3 andeveninS)/6 1/6 1 
P(B) (the number of outcomes that are even in S) / 6 ~ 3/6 3 


Understanding Terminology and Symbols 

It is important to read each problem carefully to think about and understand 
what the events are. Understanding the wording is the first very important 
step in solving probability problems. Reread the problem several times if 
necessary. Clearly identify the event of interest. Determine whether there is 
a condition stated in the wording that would indicate that the probability is 
conditional; carefully identify the condition, if any. 

Exercise: 


Problem: 


In a particular college class, there are male and female students. Some 
students have long hair and some students have short hair. Write the 
symbols for the probabilities of the events for parts (a) through (j) 
below. (Note that you can't find numerical answers here. You were not 
given enough information to find any probability values yet; 
concentrate on understanding the symbols.) 


e Let F be the event that a student is female. 

e Let M be the event that a student is male. 

e Let S be the event that a student has short hair. 
e Let L be the event that a student has long hair. 


e a The probability that a student does not have long hair. 

¢ b The probability that a student is male or has short hair. 

¢ c The probability that a student is a female and has long hair. 

e d The probability that a student is male, given that the student has 
long hair. 


e e The probability that a student has long hair, given that the 
student is male. 

e f Of all the female students, the probability that a student has 
short hair. 

e g Of all students with long hair, the probability that a student is 
female. 

e h The probability that a student is female or has long hair. 

e i The probability that a randomly selected student is a male 
student with short hair. 

e j The probability that a student is female. 


Solution: 


e a P(L')=P(S) 
b P(M or S) 
c P(F and L) 
d P(M|L) 

e P(L|M) 

f P(S|F) 

g P(F|L) 

h P(F or L) 
e i P(M andS) 
PCP) 
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Glossary 


Conditional Probability 
The likelihood that an event will occur given that another event has 
already occurred. 


Equally Likely 
Each outcome of an experiment has the same probability. 


Experiment 
A planned activity carried out under controlled conditions. 


Event 
A subset in the set of all outcomes of an experiment. The set of all 
outcomes of an experiment is called a sample space and denoted 
usually by S. An event is any arbitrary subset in S. It can contain one 
outcome, two outcomes, no outcomes (empty subset), the entire 
sample space, etc. Standard notations for events are capital letters such 
as A, B, CG, etc. 


Outcome (observation) 
A particular result of an experiment. 


Probability 
A number between 0 and 1, inclusive, that gives the likelihood that a 
specific event will occur. The foundation of statistics is given by the 
following 3 axioms (by A. N. Kolmogorov, 1930’s): Let S denote the 
sample space and A and B are two events in S'. Then: 


* 0 < P(A) <1; 

e If A and B are any two mutually exclusive events, then 
P(Aor B) = P(A) + P(B). 

* P(S)=1. 


Sample Space 
The set of all possible outcomes of an experiment. 


Independent and Mutually Exclusive Events 

Probability: Independent and Mutually Exclusive Events is part of the 
collection col10555 written by Barbara Illowsky and Susan Dean and 
explains the concept of independent events, where the probability of event 
A does not have any effect on the probability of event B, and mutually 
exclusive events, where events A and B cannot occur at the same time. The 
module has contributions from Roberta Bloom. 


Independent and mutually exclusive do not mean the same thing. 


Independent Events 


Two events are independent if the following are true: 


» P(A|B) = P(A) 


Two events A and B are independent if the knowledge that one occurred 
does not affect the chance the other occurs. For example, the outcomes of 
two roles of a fair die are independent events. The outcome of the first roll 
does not change the probability for the outcome of the second roll. To show 
two events are independent, you must show only one of the above 
conditions. If two events are NOT independent, then we say that they are 
dependent. 


Sampling may be done with replacement or without replacement. 


¢ With replacement: If each member of a population is replaced after it 
is picked, then that member has the possibility of being chosen more 
than once. When sampling is done with replacement, then events are 
considered to be independent, meaning the result of the first pick will 
not change the probabilities for the second pick. 

e Without replacement:: When sampling is done without replacement, 
then each member of a population may be chosen only once. In this 
case, the probabilities for the second pick are affected by the result of 


the first pick. The events are considered to be dependent or not 
independent. 


If it is not known whether A and B are independent or dependent, assume 
they are dependent until you can show otherwise. 


Mutually Exclusive Events 


A and B are mutually exclusive events if they cannot occur at the same 
time. This means that A and B do not share any outcomes and 
P(A AND B) = 0. 


For example, suppose the sample space S = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}. 
Let A = {1, 2,3, 4,5}, B = {4, 5, 6, 7, 8}, and C = {7, 9}. 

A AND B = {4,5}. P(A AND B) = and is not equal to zero. 
Therefore, A and B are not mutually exclusive. A and C do not have any 
numbers in common so P(A AND C) = 0. Therefore, A and Care 
mutually exclusive. 


If it is not known whether A and B are mutually exclusive, assume they 
are not until you can show otherwise. 


The following examples illustrate these definitions and terms. 


Example: 

Flip two fair coins. (This is an experiment.) 

The sample space is {HH, HT, TH, TT} where T = tails and H = heads. 
The outcomes are HH, HT, TH, and TT. The outcomes HT and TH are 
different. The HT means that the first coin showed heads and the second 
coin showed tails. The ‘TH means that the first coin showed tails and the 
second coin showed heads. 


e Let A = the event of getting at most one tail. (At most one tail means 
0 or 1 tail.) Then A can be written as {HH, HT, TH}. The outcome 
HH shows 0 tails. HT and TH each show 1 tail. 


¢ Let B= the event of getting all tails. B can be written as {TT}. B is 
the complement of A. So, B = A’. Also, 
P(A) + P(B) = P(A) + P(A’) = 1. 

e The probabilities for A and for B are P(A) = + and P(B) = i. 

e Let C = the event of getting all heads. C = {HH}. Since B = {TT}, 
P(B AND C) = 0. Band C are mutually exclusive. (B and C have 
no members in common because you cannot have all tails and all 
heads at the same time.) 

e Let D = event of getting more than one tail. D = {TT}. P(D) = + 

¢ Let & = event of getting a head on the first roll. (This implies you can 
get either a head or tail on the second roll.) FE = {HT, HH}. 
Ph). 

e Find the probability of getting at least one (1 or 2) tail in two flips. 
Let F' = event of getting at least one tail in two flips. 
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Example: 

Roll one fair 6-sided die. The sample space is {1, 2, 3, 4, 5, 6}. Let event 
A =a face is odd. Then A = {1, 3, 5}. Let event B =a face is even. Then 
B = {2, 4, 6}. 


e Find the complement of A, A’. The complement of A, A’, is B 
because A and B together make up the sample space. 
P(A) + P(B) = P(A) + P(A’) = 1. Also, P(A) = 3 and 
P(B) = ¢ 

e Let event C = odd faces larger than 2. Then C' = {3, 5}. Let event D 
= all even faces smaller than 5. Then D = {2,4}. P(C and D) = 0 
because you cannot have an odd and even face at the same time. 
Therefore, C' and D are mutually exclusive events. 

e Let event F = all faces less than 5. H = {1, 2, 3, 4}. 
Exercise: 


Problem: 


Are Cand E mutually exclusive events? (Answer yes or no.) 
Why or why not? 


Solution: 


No. C = {3, 5} and E = {1, 2, 3, 4}. P(C AND E) = <. Tobe 
mutually exclusive, P(C AND E) must be 0. 


e Find P(C|A). This is a conditional. Recall that the event C’ is {3, 5} 
and event A is {1, 3, 5}. To find P(C|A), find the probability of C 
using the sample space A. You have reduced the sample space from 
the original sample space {1, 2, 3, 4, 5, 6} to {1, 3, 5}. So, 

P(C|A) = 4 


Example: 

Let event G = taking a math class. Let event H = taking a science class. 
Then, G AND H = taking a math class and a science class. Suppose 
P(G) = 0.6, P(H) = 0.5, and P(G AND H) = 0.3. Are G and H 
independent? 

If G and A are independent, then you must show ONE of the following: 


» P(G|H) = P(G) 


Note:The choice you make depends on the information you have. You 
could choose any of the methods here because you have the necessary 
information. 


Exercise: 


Problem: Show that P(G|H) = P(G). 


Solution: 


P(G AND H 
P(G|H) = “Sa = 38 = 0.6 = PG) 


Exercise: 
Problem: Show P(G AND H) = P(G)- P(H). 


Solution: 


P(G)-P(H) = 0.6-0.5 = 0.3 = P(G AND H) 


Since G and H are independent, then, knowing that a person is taking a 
science class does not change the chance that he/she is taking math. If the 
two events had not been independent (that is, they are dependent) then 
knowing that a person is taking a science class would change the chance 
he/she is taking math. For practice, show that P(H|G) = P(H) to show 
that G and H are independent events. 


Example: 

In a box there are 3 red cards and 5 blue cards. The red cards are marked 
with the numbers 1, 2, and 3, and the blue cards are marked with the 
numbers 1, 2, 3, 4, and 5. The cards are well-shuffled. You reach into the 
box (you cannot see into it) and draw one card. 

Let R = red card is drawn, B = blue card is drawn, & = even-numbered 
card is drawn. 

The sample space S = R1, R2, R3, B1, B2, B3, B4, B5. S has 8 
outcomes. 


e2B Ch 3. P(B) = 2. P(R AND B) = 0. (You cannot draw one 
card that is both red and blue.) 


e P(E) = 3. (There are 3 even-numbered cards, R2, B2, and B4.) 

¢ P(E|B) = 2. (There are 5 blue cards: B1, B2, B3, B4, and Bd. Out 
of the blue cards, there are 2 even cards: B2 and B4.) 

¢ P(B|E) = &. (There are 3 even-numbered cards: R2, B2, and B4. 
Out of the even-numbered cards, 2 are blue: B2 and B4.) 

e The events R and B are mutually exclusive because 
P{R.AND B) = 0: 

e Let G = card with a number greater than 3. G = {B4, B5}. 
B(G) = 2. Let = blue card numbered between 1 and 4, inclusive. 
H = {B1, B2, B3, B4}. P(G|H) = 4. (The only card in H that has 
a number greater than 3 is B4.) Since = +, P(G) = P(G|H) 
which means that G and H are independent. 


Example: 

In a particular college class, 60% of the students are female. 50 % of all 
students in the class have long hair. 45% of the students are female and 
have long hair. Of the female students, 75% have long hair. Let F be the 
event that the student is female. Let L be the event that the student has long 
hair. One student is picked randomly. Are the events of being female and 
having long hair independent? 


e The following probabilities are given in this example: 
eR (R= 060 P(e) 40- 50 

e P(F AND L) = 0.45 

e P(L|F) = 0.75 


Note:The choice you make depends on the information you have. You 
could use the first or last condition on the list for this example. You do not 
know P(F|L) yet, so you can not use the second condition. 


Solution 1 

Check whether P(F and L) = P(F)P(L): We are given that P(F and L) = 0.45 
; but P(F)P(L) = (0.60)(0.50)= 0.30 The events of being female and having 
long hair are not independent because P(F and L) does not equal P(F)P(L). 
Solution 2 

check whether P(L|F) equals P(L): We are given that P(L|F) = 0.75 but 
P(L) = 0.50; they are not equal. The events of being female and having 
long hair are not independent. 

Interpretation of Results 

The events of being female and having long hair are not independent; 
knowing that a student is female changes the probability that a student has 
long hair. 


**Example 5 contributed by Roberta Bloom 


Glossary 


Independent Events 
The occurrence of one event has no effect on the probability of the 
occurrence of any other event. Events A and B are independent if one 
of the following is true: (1). P(A|B) = P(A); (2) P(B|A) = P(B); 
(3) P(Aand B) = P(A)P(B). 


Mutually Exclusive 
An observation cannot fall into more than one class (category). Being 
in more than one category prevents being in a mutually exclusive 
category. 


Two Basic Rules of Probability 
This module introduces the multiplication and addition rules used when calculating 
probabilities. 


The Multiplication Rule 


If A and B are two events defined on a sample space, then: 
P(A AND B) = P(B)- P(A|B). 


This rule may also be written as : P(A|B)= ee ae 
(The probability of A given B equals the probability of A and B divided by the 
probability of B.) 


If A and B are independent, then P(A|B) = P(A). Then 
P(A AND B) = P(A|B) P(B) becomes P(A AND B) = P(A) P(B). 


The Addition Rule 


If A and B are defined on a sample space, then: 
P(A OR B) = P(A) + P(B) — P(A AND B). 


If A and B are mutually exclusive, then P(A AND B) = 0. Then 
P(A OR B) = P(A) + P(B) — P(A AND B) becomes 
P(A OR B) = P(A) + P(B). 


Example: 
Klaus is trying to choose where to go on vacation. His two choices are: A = New 
Zealand and B = Alaska 


¢ Klaus can only afford one vacation. The probability that he chooses A is 
P(A) = 0.6 and the probability that he chooses B is P(B) = 0.35. 

e P(A and B) = 0 because Klaus can only afford to take one vacation 

e Therefore, the probability that he chooses either New Zealand or Alaska is 
P(A OR B) = P(A) + P(B) = 0.6 + 0.35 = 0.95. Note that the probability 
that he does not choose to go anywhere on vacation must be 0.05. 


Example: 


Carlos plays college soccer. He makes a goal 65% of the time he shoots. Carlos is going 
to attempt two goals in a row in the next game. 

A = the event Carlos is successful on his first attempt. P(A) = 0.65. B = the event 
Carlos is successful on his second attempt. P(B) = 0.65. Carlos tends to shoot in 
streaks. The probability that he makes the second goal GIVEN that he made the first 
goal is 0.90. 

Exercise: 


Problem: What is the probability that he makes both goals? 


Solution: 


The problem is asking you to find P(A AND B) = P(B AND A). Since 
P(B|A) = 0.90: 
Equation: 


P(B AND A) = P(BJA) P(A) = 0.90*0.65 = 0.585 


Carlos makes the first and second goals with probability 0.585. 
Exercise: 


Problem: 
What is the probability that Carlos makes either the first goal or the second goal? 
Solution: 


The problem is asking you to find P(A OR B). 
Equation: 


P(A ORB) = P(A) + P(B) — P(A AND B) = 0.65 + 0.65 — 0.585 = 0.715 


Carlos makes either the first goal or the second goal with probability 0.715. 
Exercise: 


Problem: Are A and B independent? 


Solution: 


No, they are not, because P(B AND A) = 0.585. 
Equation: 


P(B) - P(A) = (0.65) - (0.65) = 0.423 
Equation: 


0.423 4 0.585 = P(B AND A) 


So, P(B AND A) is not equal to P(B) - P(A). 


Exercise: 


Problem: Are A and B mutually exclusive? 


Solution: 
No, they are not because P(A and B) = 0.585. 


To be mutually exclusive, P(A AND B) must equal 0. 


Example: 

A community swim team has 150 members. Seventy-five of the members are advanced 
swimmers. Forty-seven of the members are intermediate swimmers. The remainder are 
novice swimmers. Forty of the advanced swimmers practice 4 times a week. Thirty of 
the intermediate swimmers practice 4 times a week. Ten of the novice swimmers 
practice 4 times a week. Suppose one member of the swim team is randomly chosen. 
Answer the questions (Verify the answers): 

Exercise: 


Problem: What is the probability that the member is a novice swimmer? 


Solution: 


28 
150 


Exercise: 


Problem: What is the probability that the member practices 4 times a week? 


Solution: 


80 
150 


Exercise: 
Problem: 


What is the probability that the member is an advanced swimmer and practices 4 
times a week? 


Solution: 


40 
150 


Exercise: 
Problem: 
What is the probability that a member is an advanced swimmer and an 


intermediate swimmer? Are being an advanced swimmer and an intermediate 
swimmer mutually exclusive? Why or why not? 


Solution: 


P(advanced AND intermediate) = 0, so these are mutually exclusive events. 
A swimmer cannot be an advanced swimmer and an intermediate swimmer at the 
same time. 


Exercise: 


Problem: 


Are being a novice swimmer and practicing 4 times a week independent events? 
Why or why not? 


Solution: 


No, these are not independent events. 


Equation: 

P(novice AND practices 4 times per week) = 0.0667 
Equation: 

P(novice) - P(practices 4 times per week) = 0.0996 
Equation: 


0.0667 ¢ 0.0996 


Example: 

Studies show that, if she lives to be 90, about 1 woman in 7 (approximately 14.3%) will 
develop breast cancer. Suppose that of those women who develop breast cancer, a test is 
negative 2% of the time. Also suppose that in the general population of women, the test 
for breast cancer is negative about 85% of the time. Let B = woman develops breast 
cancer and let NV = tests negative. Suppose one woman is selected at random. 

Exercise: 


Problem: 


What is the probability that the woman develops breast cancer? What is the 
probability that woman tests negative? 


Solution: 


P= 0143 SPN) 0:85 
Exercise: 


Problem: 


Given that the woman has breast cancer, what is the probability that she tests 
negative? 


Solution: 


P(N|B) = 0.02 
Exercise: 


Problem: 
What is the probability that the woman has breast cancer AND tests negative? 
Solution: 


Pe AND N)— PB) PNB) = (0143) = (0:02) — 00029 
Exercise: 


Problem: 
What is the probability that the woman has breast cancer or tests negative? 


Solution: 


P(B ORN) = P(B) + P(N) — P(B AND N) = 0.143 + 0.85 — 0.0029 = 0.9901 


Exercise: 


Problem: Are having breast cancer and testing negative independent events? 


Solution: 


No. P(N) = 0.85; P(N|B) = 0.02. So, P(N|B) does not equal P(N) 


Exercise: 


Problem: Are having breast cancer and testing negative mutually exclusive? 


Solution: 


No. P(B AND N) = 0.0029. For B and N to be mutually exclusive, 
P(B AND N) must be 0. 


Glossary 


Independent Events 
The occurrence of one event has no effect on the probability of the occurrence of 
any other event. Events A and B are independent if one of the following is true: (1). 
P(A|B) = P(A); (2) P(B|A) = P(B); (3) P(Aand B) = P(A)P(B). 


Mutually Exclusive 
An observation cannot fall into more than one class (category). Being in more than 
one category prevents being in a mutually exclusive category. 


Sample Space 
The set of all possible outcomes of an experiment. 


Contingency Tables 
This module introduces the contingency table as a way of determining conditional 
probabilities. 


A contingency table provides a way of portraying data that can facilitate calculating 
probabilities. The table helps in determining conditional probabilities quite easily. The 
table displays sample values in relation to two different variables that may be dependent 
or contingent on one another. Later on, we will use contingency tables again, but in 
another manner. Contingincy tables provide a way of portraying data that can facilitate 
calculating probabilities. 


Example: 
Suppose a study of speeding violations and drivers who use car phones produced the 
following fictional data: 


Speeding violation in No speeding violation in 

the last year the last year Total 
Cappel 25 280 305 
user 
een ces 45 405 450 
phone user 
Total 70 685 755 


The total number of people in the sample is 755. The row totals are 305 and 450. The 
column totals are 70 and 685. Notice that 305 + 450 = 755 and 70 + 685 = 755. 
Calculate the following probabilities using the table 

Exercise: 


Problem: P(person is a car phone user) = 


Solution: 


number of car phone users __ 305 
total number in study — 755 


Exercise: 


Problem: P(person had no violation in the last year) = 


Solution: 


number that had no violation __ 685 
total number in study 55 


Exercise: 


Problem: 
P(person had no violation in the last year AND was a car phone user) = 


Solution: 


280 
795 


Exercise: 


Problem: 


P(person is a car phone user OR person had no violation in the last year) = 


Solution: 

(355 +755) “755 = 55 
Exercise: 

Problem: 


P(person is a car phone user GIVEN person had a violation in the last year) = 
Solution: 


2. (The sample space is reduced to the number of persons who had a violation.) 
Exercise: 


Problem: 
P(person had no violation last year GIVEN person was not a car phone user) = 


Solution: 


== (The sample space is reduced to the number of persons who were not car phone 


users. ) 


Example: 
The following table shows a random sample of 100 hikers and the areas of hiking 
preferred: 


The Near Lakes and On Mountain 
Sex Coastline Streams Peaks Total 
Female 18 16 ae: 45 
Male — — 14 55 
Total = 41 


Hiking Area Preference 


Exercise: 


Problem: Complete the table. 


Solution: 
The Near Lakes and On Mountain 
Sex Coastline Streams Peaks Total 
Female 18 16 11 45 
Male 16 25 14 55 
Total 34 41 25 100 


Hiking Area Preference 


Exercise: 


Problem: 
Are the events "being female" and "preferring the coastline" independent events? 
Let F' = being female and let C’' = preferring the coastline. 


* aP(F AND C) = 
¢ bP(F) - P(C) = 


Are these two numbers the same? If they are, then F and C’ are independent. If they 
are not, then F’ and C are not independent. 


Solution: 
¢ aP(F AND C) = == =0.18 
¢ bP(F)-P(C) = > - 4a = 0.45 -0.34 = 0.153 


P(F AND C) # P(F) - P(C), so the events F' and C are not independent. 
Exercise: 

Problem: 

Find the probability that a person is male given that the person prefers hiking near 


lakes and streams. Let M = being male and let L = prefers hiking near lakes and 
streams. 


e¢ aWhat word tells you this is a conditional? 
¢ bFill in the blanks and calculate the probability: P(__|__) = 
e cls the sample space for this problem all 100 hikers? If not, what is it? 


Solution: 
e aThe word 'given' tells you that this is a conditional. 
e bP(M|L) = 2 
e cNo, the sample space for this problem is 41. 


Exercise: 


Problem: 


Find the probability that a person is female or prefers hiking on mountain peaks. 
Let F = being female and let P = prefers mountain peaks. 


aP(F) = 
3 DE CE = 
¢ cP(F AND P) = 
e dTherefore, P(F OR P) = 


Solution: 
_ 45 
e aP(F) = Ry 
¢ bP(P) = Fae 
e cP(F AND P) = a 
° dP(FORP)= 3 +4-#=2 
Example: 


Muddy Mouse lives in a cage with 3 doors. If Muddy goes out the first door, the 
probability that he gets caught by Alissa the cat is ~ and the probability he is not caught 
is 2. If he goes out the second door, the probability he gets caught by Alissa is - and 
the probability he is not caught is 3. The probability that Alissa catches Muddy coming 
out of the third door is - and the probability she does not catch Muddy is S$. It is 


equally likely that Muddy will choose any of the three doors so the probability of 
choosing each door is = 


Caught or Not Door One Door Two Door Three Total 
1 1 i 

Caught TH Tol ‘ei 

Not Caught “= — 4 


Caught or Not Door One Door Two Door Three 


Total 


Door Choice 


° The first entry == = (+) (4) is P(Door One AND Caught). 
¢ The entry = = () SS is P(Door One AND Not Caught). 


Verify the remaining entries. 
Exercise: 


Problem: 


Total 


Complete the probability contingency table. Calculate the entries for the totals. 


Verify that the lower-right corner entry is 1. 


Solution: 


Caught or Not Door One Door Two Door Three 


1 1 il 
Cau ght 15 TD ras 
Not Caught = a 4 
Total aa ~ = 


Door Choice 


Exercise: 


Problem: What is the probability that Alissa does not catch Muddy? 


Solution: 


AL 
60 


Exercise: 


Total 


19 
60 


AL 
60 


1 


Problem: 


What is the probability that Muddy chooses Door One OR Door Two given that 
Muddy is caught by Alissa? 


Solution: 


=e 
19 


Note: You could also do this problem by using a probability tree. See the Tree Diagrams 
(Optional) section of this chapter for examples. 


Glossary 


Contingency Table 
The method of displaying a frequency distribution as a table with rows and columns 
to show how two variables may be dependent (contingent) upon each other. The 
table provides an easy way to calculate conditional probabilities. 


Frequency and Contingency Tables in R 
A brief module demonstrating the table, prop.table, and related functions in 
R to complement the content found in Collaborative Statistics. 


The examples in the previous section provided you with contingency tables 
and asked you to calculate row, column, and cell percentages and totals. In 
real practice, you will most likely be dealing with raw data that needs to 
first be summarized before creating the contingency tables. This section 
will introduce some of the many methods that can be used in R to calculate 
frequencies and create contingency tables. 


Here, we will generate some sample data and view the first and last few 
lines of the dataset. In reality, you would probably have additional variables 
in the dataset, in which case when you are creating your tables, you will 
have to specify the columns to tabulate. 


set.seed(1) 
myDF = data.frame(car.phone.use = sample(c(TRUE, 
FALSE), 755, 


replace=TRUE, prob=c(.4, .6)), 
speed.violation = 
sample(c(TRUE, FALSE), 755, 


replace=TRUE, prob=c(.1, .9))) 
# First few cases 


head (myDF ) 

## car.phone.use speed.violation 
HH 1 FALSE FALSE 
HH 2 FALSE FALSE 
#H 3 FALSE FALSE 
HH 4 TRUE TRUE 
#H 5 FALSE FALSE 
#H 6 TRUE FALSE 


# Last few cases 
tail(myDF ) 


HH car.phone.use speed.violation 


#H 750 TRUE FALSE 
#H 751 FALSE FALSE 
#H 752 TRUE FALSE 
#H 753 FALSE FALSE 
#H 154 FALSE FALSE 
#H (55 FALSE FALSE 


R has several useful in-built functions for tabulation and contingency tables. 
In particular, the functions table() and prop. table() area good 
starting point. The table( ) function will create a basic cross table of the 
specified variables. The prop. table( ) takes a table (or matrix) as its 
input and is usually used to return cell percentages (no second argument), 
row percentages (1 as the second argument), or column percentages (2 as 
the second argument). 


# Simple tabulation of the two variables 
myTable = table(myDF) 


myTable 

Hi speed.violation 
## car.phone.use FALSE TRUE 

HE FALSE 411 48 

HH TRUE 264 32 


# Adding row and column sums 


addmargins(myTable) 

HH speed.violation 
## car.phone.use FALSE TRUE Sum 
HH FALSE 411 48 459 
HH TRUE 264 32 296 
Hi Sum 675 80 755 


# Cell percentages of total 
prop.table(myTable) 

HH speed.violation 
## Car.phone.use FALSE TRUE 


HH FALSE 0.54437 0.06358 
HH TRUE 0.34967 0.04238 


# ROw percentages 
prop.table(myTable, 1) 


Hi speed.violation 
## car.phone.use FALSE TRUE 
He FALSE 0.8954 0.1046 
He TRUE 0.8919 0.1081 


# Column percentages 
prop.table(myTable, 2) 


HH speed.violation 
## car.phone.use FALSE TRUE 
HH FALSE 0.6089 0.6000 
Hit TRUE 0.3911 0.4000 


An alternative to creating these tables separately is to use the 
CrossTable( ) function from the "gmodels" package (installed by 
using install.packages("gmodels" ) (only required once) and 
loaded using Library(gmodelLs ) (required once per R session)). 


# Uncomment the following to install the 


# “gmodels~ package if not yet installed. 
# install.packages('gmodels' ) 
library(gmodels) 
CrossTable(myTable) 

HE 

HE 

HH Cell Contents 

HH |------------------------- | 
HH | N | 
## | Chi-square contribution | 
#H | N 7 Row Total | 
#H | N / Col Total | 
#H | N / Table Total | 


HH 


## Total Observations in Table: 755 


HH 

HH 

HH 

## cCar.phone.use 
Total | 


## Column Total 
755 | 


speed.violation 


FALSE | 


TRUE 


Practice 1: Contingency Tables 

This module provides the opportunity for students to apply what they've learned about probability to 
solve a series of problems given a set of data. Students will practice constructing and interpreting 
contingency tables. 


Student Learning Outcomes 


e The student will construct and interpret contingency tables. 


Given 

An article in the New England Journal of Medicine , reported about a study of smokers in California 
and Hawaii. In one part of the report, the self-reported ethnicity and smoking levels per day were 
given. Of the people smoking at most 10 cigarettes per day, there were 9886 African Americans, 2745 
Native Hawaiians, 12,831 Latinos, 8378 Japanese Americans, and 7650 Whites. Of the people 
smoking 11-20 cigarettes per day, there were 6514 African Americans, 3062 Native Hawaiians, 4932 
Latinos, 10,680 Japanese Americans, and 9877 Whites. Of the people smoking 21-30 cigarettes per 
day, there were 1671 African Americans, 1419 Native Hawaiians, 1406 Latinos, 4715 Japanese 
Americans, and 6062 Whites. Of the people smoking at least 31 cigarettes per day, there were 759 


African Americans, 788 Native Hawaiians, 800 Latinos, 2305 Japanese Americans, and 3970 Whites. 
((Source: http://www.nejm.org/doi/full/10.1056/NEJMoa033250)) 


Complete the Table 


Complete the table below using the data provided. 


Smoking African Native Japanese 
Level American Hawaiian Latino Americans White TOTALS 


1-10 
11-20 
21-30 
31+ 
TOTALS 


Smoking Levels by Ethnicity 


Analyze the Data 


Suppose that one person from the study is randomly selected. 


Exercise: 


Problem: Find the probability that person smoked 11-20 cigarettes per day. 


Solution: 


35,065 
100,450 


Exercise: 


Problem: Find the probability that person was Latino. 


Solution: 


19,969 
100,450 


Discussion Questions 


Exercise: 


Problem: 


In words, explain what it means to pick one person from the study and that person is “Japanese 
American AND smokes 21-30 cigarettes per day.” Also, find the probability. 
Solution: 


4,715 
100,450 


Exercise: 


Problem: 


In words, explain what it means to pick one person from the study and that person is “Japanese 
American OR smokes 21-30 cigarettes per day.” Also, find the probability. 


Solution: 


36,636 
100,450 


Exercise: 


Problem: 


In words, explain what it means to pick one person from the study and that person is “Japanese 
American GIVEN that person smokes 21-30 cigarettes per day.” Also, find the probability. 


Solution: 


4715 
15,273 


Exercise: 


Problem: Prove that smoking level/day and ethnicity are dependent events. 


Practice 2: Calculating Probabilities 

This module allows students to practice using what they've learned about Probability. Students will apply their 
understanding of basic probability terms, calculate probabilities based on the data provided, and determine whether 
events are independent or mutually exclusive. 


Student Learning Outcomes 
e Students will define basic probability terms. 


e Students will calculate probabilities. 
e Students will determine whether two events are mutually exclusive or whether two events are independent. 


Note: Use probability rules to solve the problems below. Show your work. 


Given 

48% of all Californians registered voters prefer life in prison without parole over the death penalty for a person 
convicted of first degree murder. Among Latino California registered voters, 55% prefer life in prison without 
parole over the death penalty for a person convicted of first degree murder. (Source: 
http://field.com/fieldpollonline/subscribers/RIs2393.pdf ). 

37.6% of all Californians are Latino (Source: U.S. Census Bureau). 

In this problem, let: 


C = Californians (registered voters) preferring life in prison without parole over the death penalty f 
e L = Latino Californians 


Suppose that one Californian is randomly selected. 


Analyze the Data 
Exercise: 
Problem: P(C) = 
Solution: 


0.48 


Exercise: 


Problem: P(L) = 


Solution: 


0.376 


Exercise: 
Problem: P(C|L) = 


Solution: 


0.55 


Exercise: 


Problem: In words, what is " C'|L"? 


Exercise: 


Problem: P(L AND C) = 


Solution: 


0.2068 


Exercise: 


Problem: [In words, what is “Z and C”’? 


Exercise: 


Problem: Are L and C’ independent events? Show why or why not. 


Solution: 
No 


Exercise: 


Problem: P(L OR C) = 


Solution: 
0.6492 


Exercise: 


Problem: [In words, what is “Z or C”? 


Exercise: 


Problem: Are LZ and C’' mutually exclusive events? Show why or why not. 


Solution: 


No 


